file It aims to load the dataset My dataset talks about Diabetes hospital u.s 1999-2008 Description: The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. Each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory, medications, and stayed up to 14
Data table Shows your data set
preprocess Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming, and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve the quality of the data and to make it more suitable for the specific data mining task.
preprocess Equal frequency discretization This method ensures that each bin contains roughly an equal number of data points, which makes it particularly useful for skewed data distributions. Replace with random values Random sampling can be of two forms with replacement or without replacement. Fast correlation based filter This functions allows selection of variables from a feature table of discrete/categorical variables and a target class.
Data info The information of data set includes number columns and numbers rows etc.
Features statistics Each attribute shows visualization, mean, median, standard deviation, Max and mean
Mosaic Display graphical method to visualize the counts in n-way contingency tables, that is, tables where each cell corresponds to a distinct value-combination of n attributes.
Boxplot a method for demonstrating graphically the locality, spread and skewness groups of numerical data through their quartiles.In addition to the box on a box plot, there can be lines (which are called whiskers ) extending from the box indicating variability outside the upper and lower quartiles
scatterplots allows you to visualize relationships between two variables. Its name comes from the graph's design—it looks like a collection of dots scattered across an x- and y-axis .
Distribution displays the value distribution of discrete or continuous attributes. If the data contains a class variable, distributions may be conditioned on the class.
Data sampler implements several means of sampling of the data from the input channel. It outputs the sampled data set and complementary data set (with instances from the input set that are not included in the sampled data set). Output is set when the input data set is provided and after Sample Data is pressed.
Test and score can be used to test your desired learning algorithms on the dataset. You should use this widget to determine the performance of the selected to get a rough idea on the quality of the dataset and which model to use. This step is essential as it will save you a lot of time in the long run
Confusion matrix gives the number/proportion of examples from one class classified in to another (or same) class.
KNN neighbors The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. It is one of the popular and simplest classification and regression classifiers used in machine learning today. While the KNN algorithm can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.
Logistic regression used for binary classificationwhere we use sigmoid function, that takes input as independent variables and produces a probability value between 0 and 1.
Random forest is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time.
Naïve baysian The Naïve Bayes classifier is a supervised machine learning algorithm that is used for classification tasks such as text classification. They use principles of probability to perform classification tasks. Naïve Bayes is part of a family of generative learning algorithms, meaning that it seeks to model the distribution of inputs of a given class or category.
kmeans The number of clusters to form as well as the number of centroids to generate. For an example of how to choose an optimal value for n_clusters refer to Selecting the number of clusters with silhouette analysis on KMeans clustering.