SlidePub
Home
Categories
Login
Register
Home
General
Data Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learners
bhibbu
485 views
10 slides
Jul 10, 2024
Slide
1
of 10
Previous
Next
1
2
3
4
5
6
7
8
9
10
About This Presentation
Data pre processing in very simple and easy steps
Size:
103.47 KB
Language:
en
Added:
Jul 10, 2024
Slides:
10 pages
Slide Content
Slide 1
#[DataPreprocessing](CheatSheet)
1.HandlingMissingValues
●IdentifyMissingValues:df.isnull().sum()
●DropRowswithMissingValues:df.dropna()
●FillMissingValueswithaSpecificValue:df.fillna(value)
●FillMissingValueswithMean/Median/Mode:df.fillna(df.mean())
●InterpolateMissingValues:df.interpolate()
●ForwardFillorBackwardFill:df.ffill()ordf.bfill()
2.DataTransformation
●Standardization(Z-ScoreNormalization):(df-df.mean())/
df.std()
●Min-MaxNormalization:(df-df.min())/(df.max()-df.min())
●LogTransformation:np.log(df)
●SquareRootTransformation:np.sqrt(df)
●PowerTransformation(e.g.,Box-Cox):scipy.stats.boxcox(df)
3.FeatureEncoding
●One-HotEncoding:pd.get_dummies(df)
●LabelEncoding:sklearn.preprocessing.LabelEncoder()
●BinaryEncoding:category_encoders.BinaryEncoder()
●FrequencyEncoding:df.groupby('column').size() /len(df)
●MeanEncoding:df.groupby('category')['target'].mean()
4.HandlingCategoricalData
●ConverttoCategoryType:df['column'].astype('category')
●OrdinalEncoding:df['column'].cat.codes
●UsingPandas'CutforBinning:pd.cut(df['column'], bins)
●UsingPandas'QCutforQuantileBinning:pd.qcut(df['column'], q)
By:WaleedMousa
Slide 2
5.FeatureScaling
●RobustScaler:sklearn.preprocessing.RobustScaler()
●MaxAbsScaler:sklearn.preprocessing.MaxAbsScaler()
●Normalizer:sklearn.preprocessing.Normalizer()
6.FeatureSelection
●VarianceThreshold:sklearn.feature_selection.VarianceThreshold()
●SelectKBest:sklearn.feature_selection.SelectKBest()
●RecursiveFeatureElimination:sklearn.feature_selection.RFE()
●SelectFromModel:sklearn.feature_selection.SelectFromModel()
●CorrelationMatrixwithHeatmap:sns.heatmap(df.corr(), annot=True)
7.HandlingOutliers
●IQRMethod:Q1=df.quantile(0.25); Q3=df.quantile(0.75); IQR=
Q3-Q1;df[~((df<(Q1-1.5*IQR))|(df>(Q3+1.5*IQR)))]
●Z-ScoreMethod:(abs(df-df.mean())/df.std())<3
●Winsorizing:scipy.stats.mstats.winsorize()
8.TextPreprocessing(NLP)
●Tokenization:nltk.word_tokenize(text)
●RemovingStopWords:nltk.corpus.stopwords.words('english')
●Stemming:nltk.stem.PorterStemmer()
●Lemmatization:nltk.stem.WordNetLemmatizer()
●TF-IDFVectorization:
sklearn.feature_extraction.text.TfidfVectorizer()
9.TimeSeriesData
●DateTimeConversion:pd.to_datetime(df['column'])
●SetDateTimeasIndex:df.set_index('datetime_column')
●ResamplingforTimeSeriesAggregation:df.resample('D').mean()
●TimeSeriesDecomposition:
statsmodels.tsa.seasonal.seasonal_decompose(df['column'])
By:WaleedMousa
Slide 3
10.DataSplitting
●Train-TestSplit:sklearn.model_selection.train_test_split()
●K-FoldCross-Validation:sklearn.model_selection.KFold()
●StratifiedSampling:sklearn.model_selection.StratifiedKFold()
11.DataCleaning
●TrimmingWhitespace:df['column'].str.strip()
●ReplacingValues:df.replace(old_value, new_value)
●DroppingColumns:df.drop(columns=['column_to_drop'])
●RenamingColumns:df.rename(columns={'old_name': 'new_name'})
●ConvertingDataTypes:df.astype({'column': 'new_type'})
12.ImageDataPreprocessing
●ResizingImages:cv2.resize()
●NormalizingPixelValues:image/255.0
●ImageAugmentation:ImageDataGenerator() in
keras.preprocessing.image
●GrayscaleConversion:cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
13.DimensionalityReduction
●PrincipalComponentAnalysis(PCA):sklearn.decomposition.PCA()
●t-SNE:sklearn.manifold.TSNE()
●LDA:sklearn.discriminant_analysis.LinearDiscriminantAnalysis()
14.DealingwithImbalancedData
●RandomOver-Sampling:imblearn.over_sampling.RandomOverSampler()
●RandomUnder-Sampling:imblearn.under_sampling.RandomUnderSampler()
●SMOTE:imblearn.over_sampling.SMOTE()
15.CombiningFeatures
●PolynomialFeatures:sklearn.preprocessing.PolynomialFeatures()
By:WaleedMousa
Slide 4
●ConcatenatingFeatures:np.concatenate([feature1, feature2],
axis=1)
16.HandlingMultivariateData
●GrangerCausalityTest:
statsmodels.tsa.stattools.grangercausalitytests()
●VectorAutoRegression(VAR):statsmodels.tsa.api.VAR()
17.SignalProcessing
●FourierTransform:np.fft.fft()
●WaveletTransform:pywt.Wavelet()
18.ErrorMetrics
●MeanSquaredError(MSE):sklearn.metrics.mean_squared_error()
●MeanAbsoluteError(MAE):sklearn.metrics.mean_absolute_error()
●R-Squared:sklearn.metrics.r2_score()
19.DataWrangling
●PivotTables:df.pivot_table()
●StackingandUnstacking:df.stack(),df.unstack()
●MeltingData:pd.melt(df)
20.AdvancedDataFrameOperations
●ApplyFunctions:df.apply(lambda x:...)
●GroupByOperations:df.groupby('column').aggregate(function)
●MergeandJoinDataFrames:pd.merge(df1,df2,on='key'),
df1.join(df2,on='key')
21.SequenceDataProcessing
●PaddingSequences:keras.preprocessing.sequence.pad_sequences()
●One-HotEncodingforSequences:keras.utils.to_categorical()
By:WaleedMousa
Slide 5
22.DataVerification
●AssertStatements:pd.util.testing.assert_frame_equal()
●DataConsistencyCheck:pd.util.testing.assert_series_equal()
23.DataAggregation
●CumulativeSum:df.cumsum()
●CumulativeProduct:df.cumprod()
●WeightedAverage:np.average(values, weights=weights)
24.GeospatialData
●CoordinateTransformation:geopandas.GeoDataFrame()
●SpatialJoin:geopandas.sjoin()
●DistanceCalculation:geopy.distance.distance(coord1, coord2)
25.HandlingJSONData
●NormalizeJSON:pd.json_normalize(json_data)
●ReadJSON:pd.read_json('file.json')
●ToJSON:df.to_json()
26.HandlingXMLData
●ParseXML:xml.etree.ElementTree.parse('file.xml')
●FindElementsinXML:tree.findall('path')
27.ProbabilityDistributions
●NormalDistribution:np.random.normal()
●UniformDistribution:np.random.uniform()
●BinomialDistribution:np.random.binomial()
28.HypothesisTesting
●t-Test:scipy.stats.ttest_ind()
●ANOVATest:scipy.stats.f_oneway()
By:WaleedMousa
Slide 6
●Chi-SquaredTest:scipy.stats.chi2_contingency()
29.DatabaseInteraction
●ReadSQLQuery:pd.read_sql_query('SELECT *FROMtable',
connection)
●WritetoSQL:df.to_sql('table', connection)
30.DataProfiling
●DescriptiveStatistics:df.describe()
●CorrelationAnalysis:df.corr()
●UniqueValueCounts:df['column'].value_counts()
●PandasProfilingforComprehensiveReports:
pandas_profiling.ProfileReport(df)
31.AdvancedHandlingofMissingValues
●KNNImputation:fromsklearn.impute importKNNImputer;imputer=
KNNImputer(n_neighbors=5); df_imputed=imputer.fit_transform(df)
●IterativeImputation:fromsklearn.experimental import
enable_iterative_imputer; fromsklearn.impute import
IterativeImputer; imputer=IterativeImputer(); df_imputed=
imputer.fit_transform(df)
32.FeatureEngineering
●LagFeaturesforTimeSeries:df['lag_feature'] =
df['feature'].shift(1)
●RollingWindowFeatures:df['rolling_mean'] =
df['feature'].rolling(window=5).mean()
●ExpandingWindowFeatures:df['expanding_mean'] =
df['feature'].expanding().mean()
●DatetimeFeaturesExtraction:df['hour']=df['datetime'].dt.hour
●BinningNumericFeatures:pd.cut(df['numeric_feature'], bins=3,
labels=False)
33.DataNormalizationforText
By:WaleedMousa
Slide 7
●RemovingPunctuation:df['text'].str.replace('[^\w\s]', '',
regex=True)
●RemovingNumbers:df['text'].str.replace('\d+', '',regex=True)
●ConvertingtoLowercase:df['text'].str.lower()
●RemovingWhitespaces:df['text'].str.strip()
34.AdvancedTextPreprocessing
●RemovingHTMLTags:df['text'].str.replace('<.*?>', '',regex=True)
●RemovingURLs:df['text'].str.replace('http\S+|www.\S+', '',
regex=True)
●UsingNLTKforTokenization:nltk.word_tokenize(df['text'])
●UsingSpacyforLemmatization:
spacy.load('en_core_web_sm').lemmatizer(df['text'])
35.AdvancedFeatureScaling
●QuantileTransformer:sklearn.preprocessing.QuantileTransformer()
●PowerTransformer:
sklearn.preprocessing.PowerTransformer(method='yeo-johnson')
36.BalancingData
●OversamplingwithSMOTE-NCforCategoricalFeatures:
imblearn.over_sampling.SMOTENC(categorical_features=[0, 2,3])
●Cluster-BasedOversampling:
imblearn.over_sampling.ClusterCentroids()
37.FeatureSelectionBasedonModel
●L1RegularizationforFeatureSelection:
sklearn.linear_model.LogisticRegression(penalty='l1')
●Tree-BasedFeatureSelection:
sklearn.ensemble.ExtraTreesClassifier()
38.DataDiscretization
●DiscretizationintoQuantiles:pd.qcut(df['feature'], q=4)
By:WaleedMousa
Slide 8
●K-MeansDiscretization:
sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal',
strategy='kmeans')
39.DealingwithDateandTime
●TimeDeltaCalculation:(df['date_end'] -df['date_start']).dt.days
●ExtractingDayofWeek:df['date'].dt.dayofweek
●SettingFrequencyinTimeSeries:df.asfreq('D')
40.HandlingGeospatialData
●CreatingGeospatialFeatures:geopandas.GeoDataFrame(df,
geometry=geopandas.points_from_xy(df.longitude, df.latitude))
●CalculatingDistanceBetweenPoints:
df['geometry'].distance(other_point)
41.AdvancedNLPTechniques
●NamedEntityRecognition(NER)withSpacy:
spacy.load('en_core_web_sm').entity(df['text'])
●TopicModelingwithLatentDirichletAllocation(LDA):
gensim.models.LdaMulticore(corpus, num_topics=10)
42.DataDecomposition
●SingularValueDecomposition(SVD):scipy.linalg.svd(matrix)
●Non-NegativeMatrixFactorization(NMF):
sklearn.decomposition.NMF(n_components=2)
43.AdvancedImagePreprocessing
●EdgeDetectioninImages(Canny):cv2.Canny(image, threshold1,
threshold2)
●ImageThresholding:cv2.threshold(image, threshold,max_value,
cv2.THRESH_BINARY)
44.HandlingJSONandComplexDataTypes
By:WaleedMousa
Slide 9
●FlatteningJSONNestedStructures:pd.json_normalize(data,
sep='_')
●ParsingJSONStringsinDataFrame:df['json_col'].apply(lambda x:
json.loads(x))
45.WorkingwithTimeSeriesandSequences
●DifferencingaTimeSeries:df['value'].diff(periods=1)
●CreatingCumulativeFeatures:df['cumulative_sum'] =
df['value'].cumsum()
46.DataValidation
●AssertingDataframeEquality:pd.testing.assert_frame_equal(df1,
df2)
●CheckingDataFrameSchemawithPandera:
pandera.SchemaModel.validate(df)
47.CustomTransformations
●ApplyingCustomFunctions:df.apply(lambda row:
custom_function(row), axis=1)
●VectorizedStringOperations:df['text'].str.cat(sep=' ')
48.FeatureExtractionfromTimeSeries
●FourierTransformforPeriodicity:np.fft.fft(df['time_series'])
●AutocorrelationFeatures:
pd.plotting.autocorrelation_plot(df['time_series'])
49.WorkingwithAPIsandRemoteData
●ReadingDatafromaRESTAPI:pd.read_json(api_endpoint)
●LoadingDatafromCloudServices(e.g.,AWSS3):
pd.read_csv('s3://bucket_name/file.csv')
50.AdvancedDataAggregation
By:WaleedMousa
Slide 10
●WeightedMovingAverage:df['value'].rolling(window=5).apply(lambda
x:np.average(x,weights=[0.1,0.2,0.3,0.2,0.2]))
●CumulativeMaximumorMinimum:df['cumulative_max'] =
df['value'].cummax()
●GroupBywithCustomAggregationFunctions:
df.groupby('group').agg({'value': ['mean','std',
custom_agg_function]})
●PivotTablewithMultipleAggregates:
df.pivot_table(index='group', values='value', aggfunc=['mean',
'sum','count'])
By:WaleedMousa
Tags
Categories
General
Download
Download Slideshow
Get the original presentation file
Quick Actions
Embed
Share
Save
Print
Full
Report
Statistics
Views
485
Slides
10
Age
510 days
Related Slideshows
22
Pray For The Peace Of Jerusalem and You Will Prosper
RodolfoMoralesMarcuc
30 views
26
Don_t_Waste_Your_Life_God.....powerpoint
chalobrido8
32 views
31
VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf
JaiJai148317
30 views
14
Fertility awareness methods for women in the society
Isaiah47
29 views
35
Chapter 5 Arithmetic Functions Computer Organisation and Architecture
RitikSharma297999
26 views
5
syakira bhasa inggris (1) (1).pptx.......
ourcommunity56
28 views
View More in This Category
Embed Slideshow
Dimensions
Width (px)
Height (px)
Start Page
Which slide to start from (1-10)
Options
Auto-play slides
Show controls
Embed Code
Copy Code
Share Slideshow
Share on Social Media
Share on Facebook
Share on Twitter
Share on LinkedIn
Share via Email
Or copy link
Copy
Report Content
Reason for reporting
*
Select a reason...
Inappropriate content
Copyright violation
Spam or misleading
Offensive or hateful
Privacy violation
Other
Slide number
Leave blank if it applies to the entire slideshow
Additional details
*
Help us understand the problem better