Data Preprocessing Cheatsheet for learners

bhibbu 485 views 10 slides Jul 10, 2024
Slide 1
Slide 1 of 10
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10

About This Presentation

Data pre processing in very simple and easy steps


Slide Content

#[DataPreprocessing](CheatSheet)
1.HandlingMissingValues
●IdentifyMissingValues:df.isnull().sum()
●DropRowswithMissingValues:df.dropna()
●FillMissingValueswithaSpecificValue:df.fillna(value)
●FillMissingValueswithMean/Median/Mode:df.fillna(df.mean())
●InterpolateMissingValues:df.interpolate()
●ForwardFillorBackwardFill:df.ffill()ordf.bfill()
2.DataTransformation
●Standardization(Z-ScoreNormalization):(df-df.mean())/
df.std()
●Min-MaxNormalization:(df-df.min())/(df.max()-df.min())
●LogTransformation:np.log(df)
●SquareRootTransformation:np.sqrt(df)
●PowerTransformation(e.g.,Box-Cox):scipy.stats.boxcox(df)
3.FeatureEncoding
●One-HotEncoding:pd.get_dummies(df)
●LabelEncoding:sklearn.preprocessing.LabelEncoder()
●BinaryEncoding:category_encoders.BinaryEncoder()
●FrequencyEncoding:df.groupby('column').size() /len(df)
●MeanEncoding:df.groupby('category')['target'].mean()
4.HandlingCategoricalData
●ConverttoCategoryType:df['column'].astype('category')
●OrdinalEncoding:df['column'].cat.codes
●UsingPandas'CutforBinning:pd.cut(df['column'], bins)
●UsingPandas'QCutforQuantileBinning:pd.qcut(df['column'], q)
By:WaleedMousa

5.FeatureScaling
●RobustScaler:sklearn.preprocessing.RobustScaler()
●MaxAbsScaler:sklearn.preprocessing.MaxAbsScaler()
●Normalizer:sklearn.preprocessing.Normalizer()
6.FeatureSelection
●VarianceThreshold:sklearn.feature_selection.VarianceThreshold()
●SelectKBest:sklearn.feature_selection.SelectKBest()
●RecursiveFeatureElimination:sklearn.feature_selection.RFE()
●SelectFromModel:sklearn.feature_selection.SelectFromModel()
●CorrelationMatrixwithHeatmap:sns.heatmap(df.corr(), annot=True)
7.HandlingOutliers
●IQRMethod:Q1=df.quantile(0.25); Q3=df.quantile(0.75); IQR=
Q3-Q1;df[~((df<(Q1-1.5*IQR))|(df>(Q3+1.5*IQR)))]
●Z-ScoreMethod:(abs(df-df.mean())/df.std())<3
●Winsorizing:scipy.stats.mstats.winsorize()
8.TextPreprocessing(NLP)
●Tokenization:nltk.word_tokenize(text)
●RemovingStopWords:nltk.corpus.stopwords.words('english')
●Stemming:nltk.stem.PorterStemmer()
●Lemmatization:nltk.stem.WordNetLemmatizer()
●TF-IDFVectorization:
sklearn.feature_extraction.text.TfidfVectorizer()
9.TimeSeriesData
●DateTimeConversion:pd.to_datetime(df['column'])
●SetDateTimeasIndex:df.set_index('datetime_column')
●ResamplingforTimeSeriesAggregation:df.resample('D').mean()
●TimeSeriesDecomposition:
statsmodels.tsa.seasonal.seasonal_decompose(df['column'])
By:WaleedMousa

10.DataSplitting
●Train-TestSplit:sklearn.model_selection.train_test_split()
●K-FoldCross-Validation:sklearn.model_selection.KFold()
●StratifiedSampling:sklearn.model_selection.StratifiedKFold()
11.DataCleaning
●TrimmingWhitespace:df['column'].str.strip()
●ReplacingValues:df.replace(old_value, new_value)
●DroppingColumns:df.drop(columns=['column_to_drop'])
●RenamingColumns:df.rename(columns={'old_name': 'new_name'})
●ConvertingDataTypes:df.astype({'column': 'new_type'})
12.ImageDataPreprocessing
●ResizingImages:cv2.resize()
●NormalizingPixelValues:image/255.0
●ImageAugmentation:ImageDataGenerator() in
keras.preprocessing.image
●GrayscaleConversion:cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
13.DimensionalityReduction
●PrincipalComponentAnalysis(PCA):sklearn.decomposition.PCA()
●t-SNE:sklearn.manifold.TSNE()
●LDA:sklearn.discriminant_analysis.LinearDiscriminantAnalysis()
14.DealingwithImbalancedData
●RandomOver-Sampling:imblearn.over_sampling.RandomOverSampler()
●RandomUnder-Sampling:imblearn.under_sampling.RandomUnderSampler()
●SMOTE:imblearn.over_sampling.SMOTE()
15.CombiningFeatures
●PolynomialFeatures:sklearn.preprocessing.PolynomialFeatures()
By:WaleedMousa

●ConcatenatingFeatures:np.concatenate([feature1, feature2],
axis=1)
16.HandlingMultivariateData
●GrangerCausalityTest:
statsmodels.tsa.stattools.grangercausalitytests()
●VectorAutoRegression(VAR):statsmodels.tsa.api.VAR()
17.SignalProcessing
●FourierTransform:np.fft.fft()
●WaveletTransform:pywt.Wavelet()
18.ErrorMetrics
●MeanSquaredError(MSE):sklearn.metrics.mean_squared_error()
●MeanAbsoluteError(MAE):sklearn.metrics.mean_absolute_error()
●R-Squared:sklearn.metrics.r2_score()
19.DataWrangling
●PivotTables:df.pivot_table()
●StackingandUnstacking:df.stack(),df.unstack()
●MeltingData:pd.melt(df)
20.AdvancedDataFrameOperations
●ApplyFunctions:df.apply(lambda x:...)
●GroupByOperations:df.groupby('column').aggregate(function)
●MergeandJoinDataFrames:pd.merge(df1,df2,on='key'),
df1.join(df2,on='key')
21.SequenceDataProcessing
●PaddingSequences:keras.preprocessing.sequence.pad_sequences()
●One-HotEncodingforSequences:keras.utils.to_categorical()
By:WaleedMousa

22.DataVerification
●AssertStatements:pd.util.testing.assert_frame_equal()
●DataConsistencyCheck:pd.util.testing.assert_series_equal()
23.DataAggregation
●CumulativeSum:df.cumsum()
●CumulativeProduct:df.cumprod()
●WeightedAverage:np.average(values, weights=weights)
24.GeospatialData
●CoordinateTransformation:geopandas.GeoDataFrame()
●SpatialJoin:geopandas.sjoin()
●DistanceCalculation:geopy.distance.distance(coord1, coord2)
25.HandlingJSONData
●NormalizeJSON:pd.json_normalize(json_data)
●ReadJSON:pd.read_json('file.json')
●ToJSON:df.to_json()
26.HandlingXMLData
●ParseXML:xml.etree.ElementTree.parse('file.xml')
●FindElementsinXML:tree.findall('path')
27.ProbabilityDistributions
●NormalDistribution:np.random.normal()
●UniformDistribution:np.random.uniform()
●BinomialDistribution:np.random.binomial()
28.HypothesisTesting
●t-Test:scipy.stats.ttest_ind()
●ANOVATest:scipy.stats.f_oneway()
By:WaleedMousa

●Chi-SquaredTest:scipy.stats.chi2_contingency()
29.DatabaseInteraction
●ReadSQLQuery:pd.read_sql_query('SELECT *FROMtable',
connection)
●WritetoSQL:df.to_sql('table', connection)
30.DataProfiling
●DescriptiveStatistics:df.describe()
●CorrelationAnalysis:df.corr()
●UniqueValueCounts:df['column'].value_counts()
●PandasProfilingforComprehensiveReports:
pandas_profiling.ProfileReport(df)
31.AdvancedHandlingofMissingValues
●KNNImputation:fromsklearn.impute importKNNImputer;imputer=
KNNImputer(n_neighbors=5); df_imputed=imputer.fit_transform(df)
●IterativeImputation:fromsklearn.experimental import
enable_iterative_imputer; fromsklearn.impute import
IterativeImputer; imputer=IterativeImputer(); df_imputed=
imputer.fit_transform(df)
32.FeatureEngineering
●LagFeaturesforTimeSeries:df['lag_feature'] =
df['feature'].shift(1)
●RollingWindowFeatures:df['rolling_mean'] =
df['feature'].rolling(window=5).mean()
●ExpandingWindowFeatures:df['expanding_mean'] =
df['feature'].expanding().mean()
●DatetimeFeaturesExtraction:df['hour']=df['datetime'].dt.hour
●BinningNumericFeatures:pd.cut(df['numeric_feature'], bins=3,
labels=False)
33.DataNormalizationforText
By:WaleedMousa

●RemovingPunctuation:df['text'].str.replace('[^\w\s]', '',
regex=True)
●RemovingNumbers:df['text'].str.replace('\d+', '',regex=True)
●ConvertingtoLowercase:df['text'].str.lower()
●RemovingWhitespaces:df['text'].str.strip()
34.AdvancedTextPreprocessing
●RemovingHTMLTags:df['text'].str.replace('<.*?>', '',regex=True)
●RemovingURLs:df['text'].str.replace('http\S+|www.\S+', '',
regex=True)
●UsingNLTKforTokenization:nltk.word_tokenize(df['text'])
●UsingSpacyforLemmatization:
spacy.load('en_core_web_sm').lemmatizer(df['text'])
35.AdvancedFeatureScaling
●QuantileTransformer:sklearn.preprocessing.QuantileTransformer()
●PowerTransformer:
sklearn.preprocessing.PowerTransformer(method='yeo-johnson')
36.BalancingData
●OversamplingwithSMOTE-NCforCategoricalFeatures:
imblearn.over_sampling.SMOTENC(categorical_features=[0, 2,3])
●Cluster-BasedOversampling:
imblearn.over_sampling.ClusterCentroids()
37.FeatureSelectionBasedonModel
●L1RegularizationforFeatureSelection:
sklearn.linear_model.LogisticRegression(penalty='l1')
●Tree-BasedFeatureSelection:
sklearn.ensemble.ExtraTreesClassifier()
38.DataDiscretization
●DiscretizationintoQuantiles:pd.qcut(df['feature'], q=4)
By:WaleedMousa

●K-MeansDiscretization:
sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal',
strategy='kmeans')
39.DealingwithDateandTime
●TimeDeltaCalculation:(df['date_end'] -df['date_start']).dt.days
●ExtractingDayofWeek:df['date'].dt.dayofweek
●SettingFrequencyinTimeSeries:df.asfreq('D')
40.HandlingGeospatialData
●CreatingGeospatialFeatures:geopandas.GeoDataFrame(df,
geometry=geopandas.points_from_xy(df.longitude, df.latitude))
●CalculatingDistanceBetweenPoints:
df['geometry'].distance(other_point)
41.AdvancedNLPTechniques
●NamedEntityRecognition(NER)withSpacy:
spacy.load('en_core_web_sm').entity(df['text'])
●TopicModelingwithLatentDirichletAllocation(LDA):
gensim.models.LdaMulticore(corpus, num_topics=10)
42.DataDecomposition
●SingularValueDecomposition(SVD):scipy.linalg.svd(matrix)
●Non-NegativeMatrixFactorization(NMF):
sklearn.decomposition.NMF(n_components=2)
43.AdvancedImagePreprocessing
●EdgeDetectioninImages(Canny):cv2.Canny(image, threshold1,
threshold2)
●ImageThresholding:cv2.threshold(image, threshold,max_value,
cv2.THRESH_BINARY)
44.HandlingJSONandComplexDataTypes
By:WaleedMousa

●FlatteningJSONNestedStructures:pd.json_normalize(data,
sep='_')
●ParsingJSONStringsinDataFrame:df['json_col'].apply(lambda x:
json.loads(x))
45.WorkingwithTimeSeriesandSequences
●DifferencingaTimeSeries:df['value'].diff(periods=1)
●CreatingCumulativeFeatures:df['cumulative_sum'] =
df['value'].cumsum()
46.DataValidation
●AssertingDataframeEquality:pd.testing.assert_frame_equal(df1,
df2)
●CheckingDataFrameSchemawithPandera:
pandera.SchemaModel.validate(df)
47.CustomTransformations
●ApplyingCustomFunctions:df.apply(lambda row:
custom_function(row), axis=1)
●VectorizedStringOperations:df['text'].str.cat(sep=' ')
48.FeatureExtractionfromTimeSeries
●FourierTransformforPeriodicity:np.fft.fft(df['time_series'])
●AutocorrelationFeatures:
pd.plotting.autocorrelation_plot(df['time_series'])
49.WorkingwithAPIsandRemoteData
●ReadingDatafromaRESTAPI:pd.read_json(api_endpoint)
●LoadingDatafromCloudServices(e.g.,AWSS3):
pd.read_csv('s3://bucket_name/file.csv')
50.AdvancedDataAggregation
By:WaleedMousa

●WeightedMovingAverage:df['value'].rolling(window=5).apply(lambda
x:np.average(x,weights=[0.1,0.2,0.3,0.2,0.2]))
●CumulativeMaximumorMinimum:df['cumulative_max'] =
df['value'].cummax()
●GroupBywithCustomAggregationFunctions:
df.groupby('group').agg({'value': ['mean','std',
custom_agg_function]})
●PivotTablewithMultipleAggregates:
df.pivot_table(index='group', values='value', aggfunc=['mean',
'sum','count'])
By:WaleedMousa
Tags