Analysis and Prediction of Heart Disease Using Machine Learning and Data Mining Techniques

Machine (SVM), In clinical, sciences expectation of heart malady is one of the foremost troublesome undertakings. Nowadays, coronary illness may be a significant reason for bleakness and mortality in present-day society. Coronary illness could be a term that doles intent on countless ailments identified with the heart. Clinical determination is incredibly a big, however entangled errand that must be performed precisely, effectively, and unequivocally. Although huge advancement has been imagined within the finding and treatment of coronary illness, further examination is required. The accessibility of enormous measures of clinical information prompts the requirement for amazing information examination instruments to get rid of valuable information. Coronary illness determination is one in all the applications where information mining and AI instruments have demonstrated victories. This study used the machine learning algorithms KNN, Naïve Bayes, Random forest, Logistic regression, Support vector machine, J48, and Decision tree by WEKA software to spot which method provides maximum performance and accuracy. Using these algorithms with WEKA software, we made an ensemble (Vote) hybrid model by combining individual methods. Our research aims to access the effectiveness of various machine learning algorithms to diagnose the center disease and find the feasible algorithm, which is that the best for a heart condition.

demonstrated that the inescapability of hypertension in adults was around 20-25%, followed by ischemic heart disease in adults (10%), rheumatic heart disease (1.2 per thousand), and congenital heart disease (8 per thousand new imagined youngsters) [2].
Physical inertness, tobacco use, sodium affirmation, hypertension, diabetes, heaviness, and air tainting are critical issues for the risk of cardiovascular disease (CVD), and these dangerous factors are rising in Bangladesh [3]. The top explanation behind mortality, inauspiciousness, and clinical facility affirmation in the country is cardiovascular disease, according to the National Health Bulletin [4]. Russia has the most critical pace of the coronary disease. In Russia, CVD is critical prosperity stress, with 57% of all going in the country being an outcome of CVD. Mississippi is the state with the most raised death rate from coronary disease at 233.1 per 100,000 people from the masses [5]. An investigation about the coronary disease is coordinated to see coronary peril components and illness regularity among Indians, Pakistanis, and Bangladeshis, and every South Asian and European. The pros analyzed data using SPSS/PC + version 6 [6]. The purpose of this assessment was to choose the penicillin consistency for rheumatic fever patients in an NCCRF/HD referral clinical facility in Dhaka, Bangladesh. An organized cross-sectional examination was driven among 160 patients from a picked NCCRF/HD facility in Dhaka. Data were accumulated by methods for a very close gathering using a standard sorted out study [7]. The Himachal Pradesh-Rheumatic Fever/Rheumatic Heart Disease (HP-RF/RHD) Registry database of 1918 patients was inspected. Atrial fibrillation (AF) was resolved to have a 12-lead ECG. The relationship of AF with nature and earnestness of valvular brokenness was inspected by using a multivariable determined backslide model, and the nature of connection was represented as chances extent (OR) with 95% sureness ranges (C. I.) [8]. The quantifiable assessments generally used various key backslide examination for independently organized case-control considers. The authentic group STATA (structure 5.0) was used. The dependent variable was the proximity of ischemic coronary disease, and the primary variable was demoralization before the finding or pseudo end date [9]. In this assessment, the pros used STATA/SE real programming transformation 11.2. The z test was used for joined techniques, covariates, and oddities, the t-test for minor examination impacts, and the chi-square test for heterogeneity [10].

Methods and Materials
We conducted our experiment in WEKA version 3.8.3 tool. In the present study, we used nine machine learning classifier algorithms that are given below:

Random Forest Algorithm
Random forest is a flexible and easy machine-learning algorithm to use that produces a great result most of the time. It is one of the most used algorithms; because of its simplicity and diversity, it can be used for both classification and regression methods. Random forest is also used in ecommerce to determine whether a customer will like the product or not.

K-Nearest Neighbors (KNN) Algorithm
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm which can be used to solve both classification and regression problems. KNN works by finding the distances between a query and all the data in the example, selecting the specified number (K) closest to the query. The value of k must be an odd number.

Naïve Bayes Algorithm
A Naive Bayes classifier is an algorithm, which is used Bayes theorem to classify objects. Naive Bayes classifier assumes strong, naïve and independence between attributes of data points. Popular uses of naive Bayes classifiers include spam filters, text analysis, and medical diagnosis. This classifier is widely used for machine learning analysis because it is simple to implement. The Bayesian classification is a supervised learning method as well as statistical classification. It can also solve diagnosis and predictive problems. The conditional probabilities for each feature value in the test data are obtained by getting the count of instances with that feature value in a particular class and dividing the value by the count of instances with the same class in the training set. Bayes theorem provides the process of calculating posterior probability P(c|x) from P(c), P(x), and P(x|c). Look at the equation below: Here P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes). P(c) is the prior probability of class. P(x|c) is the likelihood which is the probability of predictor of class. P(x) is the prior probability of predictor.

Support Vector Machine (SVM) Algorithm
A support vector machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression process. It can solve linear as well as non-linear problems and work effectively for many practical problems. The idea of support vector machine (SVM) is simple: The algorithm creates a line or a hyperactive plane, which separates the data into two classes. Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classification and regression analysis. It is one of the most popular and widely used machine learning techniques. This algorithm is also known as binary approach algorithm because it is used for binary classification like present or absence, either normal or abnormal, impactful or none impactful. In this study, it is used for the prediction of heart disease that is heart disease or nonheart disease. SVM uses kernel trick for performing non-linear classification. A kernel is used to transform low-dimensional space into high-dimensional space. There are three types of kernel such as linear, polynomial, and radial. In our method, we use a polynomial kernel.

J48 Algorithm
J48 decision tree classification is the process of building a model of classes from a set of records that contain class labels. J48 is an extension of ID3. Some additional features of J48 are accounting for missing values, decision trees pruning, continuous attribute value ranges, derivation of rules, etc. In WEKA, data mining tool J48 is an open-source Java implementation of the C4.5 algorithm. J48 inspect the standardized data growth that essentially the outcomes the dividing the data by selecting an element. To construct the conclusion, the element extreme regular data growth is utilized. The intense technique brings to a halt if a subset related to the similar category in all the instances. J48 creates a result node use the projected values of the class. J48 can select particular attributes, lost attribute values of the information, and contrary element values.

Simple Logistic Algorithm
Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome can be measured with a dichotomous variable (in which there are only two possible outcomes). The dependent variable in the logistic regression is binary or dichotomous, i.e., it only contains data coded as one (TRUE, success, etc.) or 0 (FALSE, failure, etc.). The logistic regression aims to find out the optimum fitting (yet biologically reasonable) model to describe the relationship between the dichotomous characteristic of interest (dependent variable= response or outcome variable) and a set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula for predicting a logit transformation of the probability of the presence of the characteristic of interest: Where p is the probability of the presence of the dependent variable, the logit transformation can be defined as the logged odds:

One Rule Algorithm
One-R, which stands for "One Rule", is a simple yet accurate and more precise classification algorithm that generates one rule for each predictor in the data, then selects the rule with the smallest total error as its "one rule". To develop a rule for a predictor, we have to construct a frequency table for each predictor against the target. It has been shown that One-R produces rules, which are only slightly less accurate than state-of-the-art classification algorithms, producing rules that are simple for humans to interpret, and implementing the One Rule (OneR) Machine Learning classification algorithm with an enhancement for numeric data and missing values together with extensive diagnostic functions. It is useful as a baseline for machine learning models, and the rules are often helpful.

Zero Rule Algorithm
The Zero Rule Algorithm is a better baseline than the random algorithm. It uses more information about a given problem to create one rule to make predictions. This rule is different depending on the problem type. Zero Rule (ZeroR) is an effective procedure for classification algorithms whose output is simply the most frequently occurring classification in a data set. If 65% of data items have been classified rightly, ZeroR will presume that all data items have it and be right 65% of the time. Zero-R is the simplest classification method. It is that type of classification method that would lean on the target and ignore other attribute invasions. The baseline for both classification and regression problems is called the Zero Rule algorithm. For a regression predictive modeling problem where a numeric value is predicted, the Zero Rule algorithm predicts the mean of the training dataset-also called Zero-R or 0-R. Zero-R classifier predicts the majority category (class). Although there is no strength of prediction in Zero Rule, it is useful for determining a baseline performance as a benchmark for other classification methods. It is important to have a performance baseline on machine learning problems. It will give a point of reference to compare all other models that one can construct. For a classification predictive modeling problem where a categorical value is predicted, the Zero Rule Algorithm predicts the class value with the most observations in the training dataset.

Vote Meta Classifier Algorithm
Meta level classifier is the combination of two or baser level classifier. Meta level classifier is generally better than a single base-level classifier because it has an over-fitting problem (Figure 1).

Confusion Matrix
A confusion matrix refers to a table, which is often used to describe the performance of a classification model ("classifier") on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related method can be confusing ( Figure 2).

Data Sources
Data on heart disease is rarely available in our country. Some heart disease datasets are available in our country, which is not appropriate for using machine learning classifier algorithms. In this respect, we use the Kaggle Heart disease UCI dataset, which contains 899 observations with 14 attributes, as shown in Table 1.

Results
Our dataset contains 899 observations with 14 attributes. We imply all the nine algorithms in our dataset. We use WEKA version 3.8.3 with stratified 10-fold cross-validation to conduct all the algorithms (Table 2). a.
In Random forest classifier, we set the number of Iteration as 68. b.
In K-Nearest Neighbors Algorithm we set the neighbor number (K) as 45. c.
In Naïve Bayes classifier, we use supervised discretization to convert numeric attributes to nominal ones. d.
In Support Vector Machine, we take the complexity parameter C as 6. e.
In J48 algorithm, we take the confidence factor as 0.15. f.
In the Simple Logistic Algorithm, the maximum number of boosting iterations is 500, and we set the number of boosting iteration as 50. g.
In One Rule algorithm, we set the minimum Bucket size as 2, which is used for discretizing (the process of transferring continuous functions, models, variables into discrete counterparts) numeric attributes. h.
In Zero Rule, we use default values for all the options. i.
In Meta classifier, among base-level classifiers, we set the number of Boosting Iteration as five in Simple Logistic classifier.

Comparison of Different Machine Learning Classifier Algorithm
The paper centers around the Machine Learning calculation execution dependent on its actual positive rate, bogus positive rate, ROC territory, F-measure, review, exactness, supreme blunder rate, root mean square mistake rate, level of effective classifier, and level of erroneously characterized that implies precision. Therefore, our primary object is to discover the best Machine Learning calculation, which is the best accurately, characterized the coronary illness dataset agreeing to the predefined values. After individual portrayal, we look at nine Machine Learning calculations in a similar casing by graphically showing precision and ROC bend (Figure 3 and 4).  We speak to the individual ROC bend for every classifier dependent on the coronary illness and non-coronary illness in a singular investigation. Nevertheless, in numerous correlations, we utilize nine AI calculations. Since the ROC bend region of Random Forest (0.885, Naïve Bayes 0.893, and K-closest neighbors 0.884, Simple Logistic 0.888, and Vote Meta 0.893 calculation of the calculation is practically comparative. In the above diagram, we utilize nine AI calculations for both coronary illness and non-coronary illness. It relates ROC bend region is Support vector machine 0.808, J48 0.815, One Rule 0.756, Zero Rule 0.495.
At last, from the examination, we see that the Meta Vote and Naïve Bayes classifiers speak to the equivalent and best ROC bend zone for both coronary illness and non-coronary illness.

Discussion
Coronary illness has become a reason for expanding worry for Bangladesh, with patients experiencing it finishing off the rundown of people with non-transmittable sicknesses. In reality, cardiovascular ailments, particularly coronary course infection, are developing by scourge extents systematically. Our examination is led by WEKA instrument. Among the distinctive AI calculations, we are able to presume that Meta Vote classifier, which is the mixture of Random Forest, Simple Logistic, Naïve Bayes, Bayes net, J48, K-closest neighbor, Support vector machine, and Naïve Bayes give the equivalent and best precision of 83.6485%. Therefore the other calculations speak to the exactness as Support vector machine (81.2013%), Simple Logistic (81.2013%), Random Forest (81.4238%), J48 (80.7564%), K-closest neighbors (82.0912%), One Rule (75.7508%), Zero Rule (55.0612%) for accurately arranging of the center malady dataset. Our examination aims to personality absolutely the best AI calculation, which provides more exactness for distinguishing coronary illness tolerance. We recommended that Naïve Bayes and Meta vote calculation be more precise to foresee the coronary illness from our exploration.