Batch_-(13)DMT.pptx4rjpoioferwuherwuifedeoc ionviwofvj fw

sachinkumar934923 5 views 11 slides Sep 15, 2025
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

4rjpoioferwuherwuifedeoc ionviwofvj fw


Slide Content

DATA MINING TECHNIQUES Submitted By:- 231FA04549 231FA04C72 231FA04D55 231FA04D70 Submitted To:- MRS.SHAREEFA SYED

Patient ID HEART Rate Glucose Level Admission Delay(days) Condition Status P001 85 160 2 Stable P002 95 Critical P003 78 -30 5 Stable P004 105 210 999 Emergency P005 -40 185 3 Stable QUESTION The following dataset represents patient data collected from a city hospital

Identify any three preprocessing techniques to handle data quality issues in the Hospital dataset. Why is it important to preprocess the HeartRate attribute before predicting the risk category? The GlucoseLevel attribute contains missing and erroneous values. Explain how you would handle these using imputation. Compare standardization vs. normalization for HeartRate and GlucoseLevel. The ConditionStatus attribute has missing entries. Should these be removed or imputed? Justify. Write a Pseudo code to detect and replace outliers in HeartRate and GlucoseLevel using the IQR method.

EXPLANATION Step 1: Handling Missing Values Detect missing entries (NaN or blanks). Fill using mean/median/mode or advanced methods like KNN imputation. If a column has too many missing values, drop it. Step 2: Outlier Detection & Treatment Use IQR (Interquartile Range) or Z-score to detect extreme values. Treat by capping, replacing with median, or removing if clearly wrong. Step 3: Encoding Categorical Variables Convert text attributes (e.g., Condition Status) into numeric form. Methods: Label Encoding / One-Hot Encoding. A) Preprocessing Techniques # Handle Missing Values df [" GlucoseLevel "]. fillna ( df [" GlucoseLevel "].median(), inplace =True) df [" AdmissionDelay "]. fillna ( df [" AdmissionDelay "].mean(), inplace =True) # Encode Categorical df [" ConditionStatus "]. fillna ( df [" ConditionStatus "].mode()[0], inplace =True)

B) Importance of Preprocessing HeartRate Step 1: Scale Consistency HeartRate range may be very different from other features. Without scaling, distance-based models (KNN, SVM) will be biased. Step 2: Outlier Impact Very high (300) or very low (-40) HeartRate values distort model predictions. Step 3: Missing/Erroneous Values Missing or incorrect HeartRate misguides risk prediction models. # Remove negative values df [" HeartRate "] = df [" HeartRate "].apply(lambda x: np.nan if x < 0 else x) # Impute with median df [" HeartRate "]. fillna ( df [" HeartRate "].median(), inplace =True)

C) Handling Missing & Erroneous GlucoseLevel Step 1: Detect missing values ( NaN ) Step 2: Detect erroneous values (like -30) -> mark as missing. Step 3: Replace missing/erroneous values with median (robust to outliers). Step 4: Optionally use KNN or regression-based imputation for better accuracy. # Replace invalid (negative) with NaN df [" GlucoseLevel "] = df [" GlucoseLevel "].apply(lambda x: np.nan if x < 0 else x) # Impute with median df [" GlucoseLevel "]. fillna ( df [" GlucoseLevel "].median(), inplace =True)

D) Compare standardization vs normalization for HeartRate and GlucoseLevel Standardization (Z-score): z = (x - mean) / std - Centers data around 0 with unit variance. - Preferred for models assuming Gaussian-like data or using dot products (SVM, Logistic Regression). Normalization (Min-Max): x' = (x - min) / (max - min) - Scales to [0,1]. - Preferred when bounded inputs are required (some neural nets, or when features must be constrained). Recommendation: - For HeartRate & GlucoseLevel start with STANDARDIZATION for most ML algorithms; - Use MIN-MAX only if you need bounded input for a specific model or for UI display. from sklearn.preprocessing import StandardScaler , MinMaxScaler # Standardization (Z-score) df [" HeartRate_std "] = StandardScaler (). fit_transform ( df [[" HeartRate "]]) # Normalization (Min-Max) df [" HeartRate_norm "] = MinMaxScaler (). fit_transform ( df [[" HeartRate "]])

E) Missing ConditionStatus Do not remove rows → it causes data loss. Best to impute missing ConditionStatus using: Mode (most frequent value), OR Assign “Unknown”, OR Predict using other features ( HeartRate , GlucoseLevel ). In healthcare, every record is important , so imputation is preferred. # Mode imputation for categorical df [" ConditionStatus "]. fillna ( df [" ConditionStatus "].mode()[0], inplace =True)

For each attribute in [ HeartRate , GlucoseLevel ]: Step 1: Compute Q1 = 25th percentile and Q3 = 75th percentile Step 2: IQR = Q3 - Q1 Step 3: LowerBound = Q1 - 1.5 * IQR UpperBound = Q3 + 1.5 * IQR Step 4: For each value v in attribute: If v < LowerBound OR v > UpperBound : Replace v with Median(attribute) Step 5: End for End for Note: Mark obvious data-entry errors (e.g., negative values, unrealistic codes like 999) as missing BEFORE computing Q1/Q3. F) Pseudo Code — Outlier Detection & Replacement (IQR method)

Output PatientID HeartRate GlucoseLevel AdmissionDelay ConditionStatus F001 85 185 2.00 Stable F002 95 185 3.33 Critical F003 78 185 5.00 Stable F004 105 185 3.33 Emergency F005 95 185 3.00 Stable F006 100 185 3.33 Stable Notes: - Erroneous values (negative heart/glucose; AdmissionDelay =999) marked as NaN then imputed. - HeartRate , GlucoseLevel → median imputation (median values used: HR median=95, Glucose median=185). - AdmissionDelay → mean imputation (mean of valid delays = 3.333..., shown rounded to 2 decimals). - Outliers detected by IQR were replaced with column median. - Missing ConditionStatus filled by MODE (Stable).

THANK YOU