Diabetes prediction on women using data mining technique

Precision-Recall: With the classification problem where the data set of class is very different, there is an effective magic that is often used as Precision-Recall. With diabetes data set, the difference in outcome between diabetes and non-diabetes is quite large (Figure 1), so we used Precision-Recall to analyze the data. For this data set, we consider diabetes is positive and non-diabetes is negative.

With a way of identifying a positive class, Precision is defined as the ratio of true positive points among those classified as positive (TP + FP). Recall is defined as the ratio of true positive points among those that are actually positive (TP + FN). High accuracy means that the accuracy of the points found is high. A high recall means that the True Positive Rate is high, meaning that the rate of missing positive points is low. Returning to the diabetes data set, 3 models calculated precision and recall data and gave different results. First, with Logistic Regression, the results of precision score and recall score are 0.776 and 0.743 respectively. However, by Decision Tree, these numbers have decreased significantly with 0.719 as precision score and 0.718 as recall score. With the last model being the Random Forest, these two numbers are a bit different. With the precision score, Random Forest gave us a figure of 0.769 - this score is slightly smaller than the Logistic Regression but much larger than the Decision Tree. As for Recall score, this model returns the highest result with 0.754 compared to Logistic Regression and Decision Tree. However, calculating the precision score and recall score alone is not enough. A good classification model is one that has both Precision and Recall as high as possible, as close to 1 as possible. There are two ways to measure the quality of a classifier based on Precision and Recall: Precision-Recall curve and F1-score.

pdf 9 trang Hương Yến 01/04/2025 320
Bạn đang xem tài liệu "Diabetes prediction on women using data mining technique", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.

File đính kèm:

  • pdfdiabetes_prediction_on_women_using_data_mining_technique.pdf

Nội dung text: Diabetes prediction on women using data mining technique

  1. DIABETES PREDICTION ON WOMEN USING DATA MINING TECHNIQUE SVTH: Bui Thi Anh Nguyet, Vu Thi Lieu, Le Thi Bich Ngoc GVHD: ThS Bùi Quốc Khánh ắ : Bệnh tiểu đường là một căn bệnh mãn tính gây ra bởi lượng đường trong máu quá cao và căn bệnh này gây ra những mối đe dọa nghiêm trong cho sức hỏe con người. Với sự phát triển mạnh mẽ của lĩnh vực hoa học dữ liệu nói chung và lĩnh vực học máy nói riêng, rất nhiều những ỹ thuật của các lĩnh vực này được ứng dụng rộng rãi trong lĩnh vực y tế giúp thu được nhiều thông tin có giá trị. Mục tiêu nghiên cứu của chúng tôi là sử dụng các thuật toán hác nhau trong lĩnh vực học máy để dự đoán hả năng mắc bệnh tiểu đường ở một người phụ nữ. Dữ liệu được lấy từ bộ dữ liệu Pima Indians Diabetes và chúng tôi phân tích dữ liệu này với 3 thuật toán trong học máy: Logistic Regression, Decision Tree, Random Forest để so sánh hiệu năng của từng model. Kết quả phân tích cuối c ng đã chỉ rằng Random Forest đưa ra dự đoán chính xác nhất trong số ba thuật toán đư c thử nghiệm.] Abstract: Diabetes is a chronic ailment characterized by the degree of blood sugar which poses incredible threats to human health. With the strong growth of machine learning, diverse techniques have been employed in many aspects of medical health which help gain valuable information. The objective of this study is to structure a model which can prognosticate the probability of diabetes in chosen diabetic women. The Pima Indian diabetes database (PIDD) was obtained from the UCI repository used for analysis. In this research, three machine learning classification algorithms, correspondingly Logistic Regression, Decision Tree and Random Forest have been accomplished. The performances of all three algorithms are estimated on various metrics. The final process has shown that Random Forest has the most perceptible results out of three algorithms when all the attributes were applied. Keywords: Decision Tree, Google Colab, Logistic Regression, Pima Indian diabetes database (PIDD), Pycharm, Random Forest I. INTRODUCTION No longer confined to rich countries, diabetes has become a common disease worldwide. According to 2014 statistics from the World Health Organization (WHO), diabetes is affecting 422 million people globally. Without increased awareness and timely intervention, diabetes will become one of the seven leading causes of death by 2030. There are 3.7 million deaths caused by this disease each year. According to a study in the journal Annals of Internal Medicine, between 1971 and 2000, mortality rates among women with diabetes remain alarming. In addition, the difference in mortality between women with diabetes and non-diabetics more than doubled. Diabetes in women is different from with man because of some reason. Firstly, women are often less likely to be treated for cardiovascular risk factors and diabetes-related disorders. Next, the complications of diabetes in women are harder to diagnose. Thirdly, women often have many different types of heart disease than
  2. men. Moreover, hormones and inflammation also show very differently in women. Thus, our team decided to analyze a woman's diabetes data set to gain useful information to prevent these dangerous illnesses in women. There are a number of researches executed with different model classifiers and led to valuable results. In this paper, our team conducted using 3 classification algorithms such as Logistic Regression, Decision Tree, and Random Forest to training and testing for the diabetes data set to compare the strength of each one. In addition, an application would be created to visualize the results of research and help everyone diagnosed the diabetes disease by each classification. These results also would be useful for the classification of diabetes complication data. It would be discussed in the following sections. II. METHODOLOGY In this research, we used 3 kinds of models to predict the risk of diabetes on women such as Logistic Regression, Decision Tree, and Random Forest. Logistic Regression Model: Logistic Regression is a popular statistical technique to predict categorical outcomes (binomial or multinomial). The predictions of Logistic Regression are the form of probabilities that could happen by an event. The purpose of logistic regression is to explore the best fitting model to show the relationship between the dichotomous characteristic of interest (dependent variable) and a set of independent (predictor or explanatory) variables. Moreover, Logistic Regression performance is good when the data set is separated linearly. It is less prone to over-fitting but can be over-fit in high dimensional data sets. To avoid over-fitting, Regulation can be used. In addition, Logistic regression could implement easily and train efficiently. Decision Tree Model: Decision Tree is one of the most popular supervised learning algorithms in data mining classification because it is easy to use and understand. A decision tree is the building block of random forest and is an intuitive model. We can understand in a way that the Decision Tree is a collection of yes / no questions (considered nodes) about data until the algorithm obtains a model to make predictions similar data. However, Decision Tree is easy to get the over-fitting problem which affects the accuracy of the model. In Decision Tree, over-fitting occurs when a tree is trained and designed to match with the data in the training data set and gives high accuracy. However, when using the model obtained after training with the data set, the accuracy is not high with other data sets. There are two ways to avoid over-fitting in decision trees: pruning and random forest. Pruning is a regularization technique to avoid over-fitting for the decision tree in general. And the second way is we can combine many decision trees into a single ensemble model known as the random forest. 84
  3. Random Forest Model: The random forest combines hundreds or thousands of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree [1]. Random Forest has a different approach to Decision Tree. It is a collection of many decision trees and treats each decision tree as an independent voter (like a real election). The decision tree with the most votes will win - this means that the tree with the best results will become a model. III. IMPLEMENTATION A. Data Resource The data was gathered and made accessible by "National Institute of Diabetes and Digestive and Kidney Diseases" as a major aspect of the Pima Indians Diabetes Database. A few limitations were put on the determination of these examples from a bigger database. Specifically, all patients here have a place with the Pima Indian legacy (a subgroup of Native Americans) and are females of ages 21 or more. There are totally 768 women in the dataset including 8 features such as Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, and one target label called Outcome which decides if that woman is diabetes or not. B. Data Preparation The data has been processed cleanly and the total remaining after processing is 768 cases. In particular, there is a clear division of the results of the data: out of 768 cases, out of every 500 uninfected cases, 268 cases have diabetes. This division is shown in the figure below: F. Figure 1: The difference between diabetes (1) and no diabetes (0) After understanding about the data, start splitting the data into two sets: the training data
  4. set and the test data set in a 7-3 ratio, meaning 70% of the data for training and the remaining 30% for testing. C. Software Use Our team used Google Colab and Pycharm as the main software for this research because of their great benefits. Google Colab: Google Colab is a free cloud service and now it supports free GPU [2]. For those who are studying and researching on deep learning, the use of GPU is extremely necessary. And of course, instead of spending money on a GPU, we can use it for free with Google Colab, which provides free GPU and is an application for the field of data science in general. Google Colab allows users to develop deep learning applications on popular libraries such as PyTorch, TensorFlow, Keras, or OpenCV. It is an incredible free browser that allows users to process large amounts of data, and train model on machine learning for free. It can also import a lot of libraries, use algorithms easily, and give fast and accurate results. With the undeniable benefits of Google Colab, so we use it as a tool to process data, train and test models, plot models using scikit-learn library to have the most intuitive and clear view of the results for each model. Pycharm: To be able to code and run programs with the Python language, we need to install the Python programming environment for laptops and can install additional IDE to support us to code faster. And Pycharm is the perfect choice - an IDE developed by JetBrains. Pycharm offers a lot of smart features like code completion, indentation, error checking and suggestions for code completion correctly. This IDE also supports extremely powerful libraries and frameworks, and you can find almost libraries needed for Python programming. In this research, we created a simple website using Flask to predict a woman's risk of diabetes and Pycharm is the perfect choice to complete this website with their great features. D. Data Analysis After splitting the data set into training and test set, and selecting the model to use, we conducted training the training set with 3 models: Logistic Regression, Decision Tree, and Random Forest on diabetes data set. In addition, we used Grid Search algorithms to find the best parameters which make the model give the best results. To determine whether the model is really good or not, we take turns to calculate various metrics such as accuracy, confusion matrix, precision, recall, f1-score, and the results we got are quite high. Accuracy: The simplest and most commonly used way is accuracy. This evaluation 86
  5. method simply calculates the ratio between the number of correctly predicted points and the total number of points in the test set. In the diabetes data set, we have 230 cases separated as test data. Three models will use the algorithm to calculate the accuracy of the test data set and the result of Logistic Regression, Decision Tree, Random Forest obtained above are 0.788, 0.744, and 0.788 in the range of 0 to 1 respectively , which means that the accuracy of the model when testing data is 78.8%, 72.7%, and 78.8% respectively. Looking at to the statistic result, we can easily recognize that Random Forest and Logistic Regression have the same accuracy score and they‘re more accurate than Decision Tree. Confusion Matrix: The calculation using accuracy as above just tells us what percentage of the data is classified correctly without specifying how each type is classified, which class is best classified, and the data of which class is often misclassified into another class. To be able to evaluate these values, we use a matrix called confusion matrix. Basically, confusion matrix shows how many data points actually belong to a class, and is predicted to fall into a class [3]. The table below shows a standardized confusion matrix that shows how many percentages of the instance are categorized into the correct diabetes class and non-diabetes class, how many percent of the instances are categorized into the wrong diabetes class and non-diabetes class. Logistic Regression Positive (1) Negative (0) Positive (1) 0.59 0.41 Negative (0) 0.11 0.89 Table 1: Confusion Matrix of Logistic Regression Looking at the table above, with the non-diabetes class (0) -we can see that 89% of the instances are categorized into the non-diabetes class and 11% of the instances are assigned to the wrong diabetes class. As for the diabetic group (1) - only 59% of cases are assigned to the correct diabetes group, while up to 41% of the instance is assigned to the non-diabetic group - a fairly high percentage . However, with Decision Tree, the percentage assigned to the wrong class is not too high but not too low. Those statistics are shown in the table below: Decision Tree Positive (1) Negative (0) Positive (1) 0.63 0.37 Negative (0) 0.19 0.81 Table 2: Confusion Matrix of Decision Tree
  6. For non-diabetes class (0), the percentage of instances assigned to the correct class is 81%, while the remaining 19% is the percentage of instances assigned to the wrong diabetes class. As for diabetes class (1), only 63% of instances are assigned to the correct diabetes class, but the percentage assigned to the wrong non-diabetes class is 37% which is smaller than Logistic Regression. Random Forest Positive (1) Negative (0) Positive (1) 0.64 0.36 Negative (0) 0.13 0.87 Table 3: Confusion Matrix of Random Forest Looking at Table 3 above, we can see that with the Random Forest, the percentage of instances assigned to the correct non-diabetes class and assigned to the wrong diabetes class seem to be between the Logistic Regression and Decision Tree models. For non-diabetes class (0), 87% of instances are assigned to the right class and 13% of instances are assigned to the wrong diabetes class. This is slightly better than the Decision Tree but slightly smaller than the Logistic Regression. As for diabetes class (1), 64% of the instance is assigned to the correct class while the remaining 36% is assigned to the non-diabetes class. Compared to the 41% instance assigned to the wrong class of Logistic Regression and 37% that Decision Tree brings, this number has actually improved insignificantly. Precision-Recall: With the classification problem where the data set of class is very different, there is an effective magic that is often used as Precision-Recall. With diabetes data set, the difference in outcome between diabetes and non-diabetes is quite large (Figure 1), so we used Precision-Recall to analyze the data. For this data set, we consider diabetes is positive and non-diabetes is negative. With a way of identifying a positive class, Precision is defined as the ratio of true positive points among those classified as positive (TP + FP). Recall is defined as the ratio of true positive points among those that are actually positive (TP + FN). High accuracy means that the accuracy of the points found is high. A high recall means that the True Positive Rate is high, meaning that the rate of missing positive points is low. Returning to the diabetes data set, 3 models calculated precision and recall data and gave different results. First, with Logistic Regression, the results of precision score and recall score are 0.776 and 0.743 respectively. However, by Decision Tree, these numbers have decreased significantly with 0.719 as precision score and 0.718 as recall score. With the last model being the Random Forest, these two numbers are a bit different. With the precision score, Random Forest gave us a figure of 0.769 - this score is slightly smaller than the Logistic Regression but much larger than the Decision Tree. As for Recall score, this model returns the highest result with 0.754 compared to Logistic Regression and Decision Tree. However, calculating the precision 88
  7. score and recall score alone is not enough. A good classification model is one that has both Precision and Recall as high as possible, as close to 1 as possible. There are two ways to measure the quality of a classifier based on Precision and Recall: Precision-Recall curve and F1-score. F1-score: The F1-score is defined as the weighted harmonic mean of the test's precision and recall [4]. This score is calculated according to the formula: F1-score is directly proportional to the precision score and recall score. F1 is high when both precision score and recall are high, and the higher F1 is the better the classifier. With diabetes data set, all 3 models calculate f1-score and give different numbers together. However, the Random Forest has the highest f1-score of 0.76, while 0.753 is the f1-score obtained by the Logistic Regression, and finally the Decision Tree with 0.718 on f1-score. IV. DISCUSSION After using this algorithm, we found that Logistic Regression return the best results when ‗C‘= 38, Decision Tree with ‗max_depth‘= 2 , but Random Forest just got the best results when ‗random_state‘=0 where we‘re not used Grid Search. The table below compares the data on all 3 models and from there we have the basis to compare which model is the best among the 3 models and which model is most likely to accurately predict those 3 models: Metrics Logistic Regression Decision Tree Random Forest Accuracy 0.788 0.744 0.788 Precision 0.776 0.719 0.769 Recall 0.743 0.718 0.754 F1-score 0.753 0.718 0.76 Table 4: Performance of proposed algorithm on testing data set Looking at the table above, we can get the most intuitive view of the effectiveness of each model. With those figures, we can see that the Random Forest has the highest results, such as accuracy score, recall score, and F1-score. In contrast, Decision Tree is the model that brings the lowest results among the three models tested. Through the table above, we can somewhat know which model is more efficient classification and whether the model is reliable or not based on the metrics that we have researched.
  8. Our team also tested the effectiveness of all three models with a simple website written in Python with the help of the Flask web framework. And the results are quite consistent with the analysis that we have previously analyzed. G. Figure 2: Diabetes Prediction Website The figure above is our main website interface. This interface requires the user to input the necessary information corresponding to the features contained in the diabetes data set, and when the user clicks on the Predict button, the prediction results will show that she is likely to have diabetes or not. We tried with any row in the diabetes data set correspond with 8 features: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age with the following statistics: 8, 124, 76, 24, 600, 28.7, 0.687, 52. The results obtained after the prediction for each model are: Logistic Regression model returns the result that the person is not likely to have diabetes, Decision Tree returns the same result, while the Random Forest model returns the result that the person get risk of diabetes. When comparing the result with the actual value in the data set, only the Random Forest give the correct prediction. This is correspond with our analysis because following to Table 4, the Random Forest is the best model out of three models, so the result of the website is quite similar to what our team has studied. We continued testing three models with some other instances which show in the following figure: 90
  9. H. Figure 3: Testing instance Looking at to the Figure 3 above, the Random Forest give the correct prediction for all 4 instances tested, that prove the most efficient of this model instead of Logistic Regression and Decision Tree. We also tested 3 models with lots of different instances and the average result was that the Random Forest gave the most correct predictions including both diabetes and non-diabetes. And this is perfectly consistent with the final result that we analyzed that the Random Forest gave the highest metrics of evaluating model, then the Logistic Regression and finally the Decision Tree. V. CONCLUSION AND FUTURE WORK Being a chronic disease, diabetes is considered as the culprit of numerous deaths which prompts the need of anticipating the probability of this ailment. Therefore, the use of three machine learning algorithms is conducive to predicting diabetes‘ risk. By comparing the results of three classifications, it can be concluded that Random Forest is the optimal solution among the three algorithms. It is hoped that this research will support doctors in diagnosing and treating diabetic women.