Unveiling the Patterns: A Cluster Analysis of NYC Shootings
jadavvineet73
174 views
24 slides
May 10, 2024
Slide 1 of 24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
About This Presentation
This slideshow dives into a data-driven analysis of NYC shootings. By employing cluster analysis, we uncover hidden patterns within these incidents, providing insights that can aid in crime prevention strategies. for more such analysis and management visit : https://bostoninstituteofanalytics.org/da...
This slideshow dives into a data-driven analysis of NYC shootings. By employing cluster analysis, we uncover hidden patterns within these incidents, providing insights that can aid in crime prevention strategies. for more such analysis and management visit : https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Size: 2.13 MB
Language: en
Added: May 10, 2024
Slides: 24 pages
Slide Content
DATA SCIENCE PROJECT NYC Shootings Cluster Analysis
OBJECTIVE DATA PREPROCESSING DATA CLEANING & FORMATTING DROPPING COLUMNS FEATURE ENGINEERING CREATING DUMMY VARIABLES DATA SUMMARIZATION/ DESCRIPTIVE STATISTICS FINDING K VALUE CLUSTERING MODEL DEVELOPMENT VISUALIZATION CONCLUSION AGENDA :
Objective : THE GOAL OF THIS PROJECT IS TO DEVELOP A MACHINE LEARNING MODEL THAT CAN CLUSTER SHOOTING INCIDENTS IN NEW YORK CITY BASED ON RELEVANT ATTRIBUTES SUCH AS OCCURRENCE DATE AND TIME, LOCATION, DEMOGRAPHIC INFORMATION OF PERPETRATORS AND VICTIMS, AND JURISDICTION. BY IDENTIFYING CLUSTERS OF SIMILAR INCIDENTS, LAW ENFORCEMENT AGENCIES CAN BETTER UNDERSTAND THE UNDERLYING DYNAMICS OF GUN VIOLENCE AND TAILOR THEIR INTERVENTIONS ACCORDINGLY.
DATA PREPROCESSING DATA PREPROCESSING INVOLVES CLEANING, FORMATTING, AND TRANSFORMING RAW DATA INTO A MORE SUITABLE FORMAT FOR ANALYSIS AND MODELING. COMMON TASKS IN DATA PREPROCESSING INCLUDE HANDLING MISSING VALUES, DEALING WITH OUTLIERS, SCALING FEATURES, ENCODING CATEGORICAL VARIABLES, AND SPLITTING THE DATA INTO TRAINING AND TESTING SETS. THE GOAL OF DATA PREPROCESSING IS TO MAKE THE DATA READY FOR ANALYSIS AND MODELING BY ENSURING ITS QUALITY, CONSISTENCY, AND COMPATIBILITY WITH THE MACHINE LEARNING ALGORITHMS.
DATA CLEANING & FORMATTING FIRST I CREATED A COPY OF THE GIVEN DATA INORDER TO PERFORM THE CLEANING PROCESS BY HAVING THE ORGINAL DATA UNALTERED THEN I IMPORTED THE DATA AND CHECKED FOR THE REQUIRED COLUMNS THEN I DROPPED THE COLUMNS THAT ARE NOT REQUIRED AND THE COLUMNS THAT CONTAINED MORE BLANK VALUES THEN IN THE GIVEN DATA SET I CHANGED THE “NULL” AND “UNIDENTIFIED” VALUES INTO “UNKNOWN” FOR EASY IDENTIFICATION
DROPPING COLUMNS THESE WERE THE LIST OF THE COLUMNS THAT I HAVE DROPPED THESE COLUMNS WERE DROPPED BECAUSE THESE COLUMNS CONTAINED EITHER LESS DATA NOR UNWANTED DATA
DROPPING COLUMNS DROPPING COLUMNS REFERS TO THE PROCESS OF REMOVING CERTAIN COLUMNS OR VARIABLES FROM A DATASET. THIS IS OFTEN DONE DURING THE DATA PREPROCESSING PHASE WHEN SOME COLUMNS ARE DEEMED UNNECESSARY OR REDUNDANT FOR THE ANALYSIS OR MODELING TASK AT HAND. HERE ARE SOME COMMON SCENARIOS WHERE DROPPING COLUMNS MIGHT BE NECESSARY: IRRELEVANT FEATURES: SOME COLUMNS MAY NOT CONTRIBUTE RELEVANT INFORMATION TO THE ANALYSIS OR PREDICTION TASK HIGHLY CORRELATED FEATURES: IF TWO OR MORE COLUMNS ARE HIGHLY CORRELATED, MEANING THEY CONTAIN SIMILAR INFORMATION, DROPPING ONE OF THEM CAN REDUCE REDUNDANCY AND MULTICOLLINEARITY IN THE DATASET. THIS CAN IMPROVE THE STABILITY AND INTERPRETABILITY OF THE MODELS.
MISSING VALUES: IF A COLUMN HAS A HIGH PERCENTAGE OF MISSING VALUES AND IMPUTATION ISN'T FEASIBLE OR APPROPRIATE, DROPPING THE COLUMN MIGHT BE NECESSARY TO MAINTAIN THE INTEGRITY OF THE DATASET . DATA LEAKAGE: COLUMNS THAT CONTAIN INFORMATION ABOUT THE TARGET VARIABLE OR ARE DERIVED FROM THE TARGET VARIABLE SHOULD BE REMOVED TO PREVENT DATA LEAKAGE, WHICH COULD ARTIFICIALLY INFLATE THE MODEL'S PERFORMANCE DURING TRAINING. COMPUTATIONAL EFFICIENCY: LARGE DATASETS WITH A LARGE NUMBER OF COLUMNS CAN BE COMPUTATIONALLY EXPENSIVE TO PROCESS AND TRAIN MODELS ON. DROPPING IRRELEVANT OR REDUNDANT COLUMNS CAN HELP REDUCE THE DIMENSIONALITY OF THE DATASET AND IMPROVE COMPUTATIONAL EFFICIENCY
FEATURE ENGINEERING FEATURE ENGINEERING FOCUSES ON CREATING NEW FEATURES OR MODIFYING EXISTING ONES TO IMPROVE THE PERFORMANCE OF MACHINE LEARNING MODELS. THIS PROCESS INVOLVES SELECTING, TRANSFORMING, OR COMBINING FEATURES TO EXTRACT USEFUL INFORMATION AND REPRESENT THE DATA MORE EFFECTIVELY. FEATURE ENGINEERING TECHNIQUES INCLUDE CREATING POLYNOMIAL FEATURES, BINNING, DISCRETIZATION, DIMENSIONALITY REDUCTION (E.G., PCA), FEATURE SCALING, AND CREATING INTERACTION TERMS. THE GOAL OF FEATURE ENGINEERING IS TO ENHANCE THE PREDICTIVE POWER OF THE MODEL BY PROVIDING IT WITH MORE INFORMATIVE AND DISCRIMINATIVE FEATURES, ULTIMATELY IMPROVING ITS ACCURACY AND GENERALIZATION ABILITY.
CREATING DUMMY VARIABLES CREATING DUMMY VARIABLES REFERS TO THE PROCESS OF CONVERTING CATEGORICAL VARIABLES INTO A SET OF BINARY VARIABLES, ALSO KNOWN AS DUMMIES, THAT REPRESENT THE DIFFERENT CATEGORIES OR LEVELS OF THE ORIGINAL VARIABLE . IN SUMMARY, CREATING DUMMY VARIABLES IS A TECHNIQUE USED TO ENCODE CATEGORICAL VARIABLES INTO A FORMAT THAT CAN BE UTILIZED BY MACHINE LEARNING ALGORITHMS.
CREATING DUMMY VARIABLES BY USING THE ABOVE CODE I’VE CREATED DUMMY VARIABLES FOR CERTAIN COLUMNS. SINCE THESE COLUMNS PLAY A MAJOR ROLE IN DEVELOPING A MODEL THESE SHOULD NOT BE DROPPED BUT CANNOT BE IN STRING FORMAT EITHER. THUS DUMMY VARIABLES ARE CREATED.
HERE IN THE COLOUMN “BORO” THE VALUES ARE STRING SINCE IT HAS TO BE IN NUMERICAL FORMAT THE DUMMY VARIABLES ARE CREATED THE STRING VALUES BECOMES A COLUMN AND THEN THE VALUE ARE GIVEN IN 0 AND 1 FORMAT BASED ON TRUE OR FALSE CREATING DUMMY VARIABLES
data summarization / descriptive statistics DATA SUMMARIZATION IS USED TO DESCRIBE THE PROCESS OF CONDENSING AND PRESENTING KEY CHARACTERISTICS OR INSIGHTS FROM A DATASET. IT INVOLVES VARIOUS TECHNIQUES FOR SUMMARIZING AND ANALYZING DATA TO GAIN A BETTER UNDERSTANDING OF ITS STRUCTURE, PATTERNS, AND RELATIONSHIPS. THE DF.DESCRIBE() FUNCTION IS COMMONLY USED IN PYTHON WITH LIBRARIES LIKE PANDAS TO GENERATE DESCRIPTIVE STATISTICS OF A DATA FRAME. IT PROVIDES SUMMARY STATISTICS FOR NUMERICAL COLUMNS IN THE DATA FRAME SUCH AS COUNT, MEAN, STANDARD DEVIATION, MINIMUM, MAXIMUM, AND QUARTILE VALUES. USING DF.DESCRIBE() IS A QUICK WAY TO GET AN OVERVIEW OF THE DISTRIBUTION AND CENTRAL TENDENCY OF NUMERICAL DATA IN A DATA FRAME. IT HELPS IN UNDERSTANDING THE RANGE OF VALUES, PRESENCE OF OUTLIERS, AND OVERALL SHAPE OF THE DATA .
data summarization / descriptive statistics HERE'S WHAT EACH STATISTIC REPRESENTS : COUNT: NUMBER OF NON-NULL VALUES IN EACH COLUMN. MEAN : AVERAGE VALUE OF EACH COLUMN. STD : STANDARD DEVIATION, A MEASURE OF THE DISPERSION OF VALUES AROUND THE MEAN. MIN : MINIMUM VALUE IN EACH COLUMN . 25%: FIRST QUARTILE, OR 25TH PERCENTILE. 50 %: MEDIAN, OR 50TH PERCENTILE. 75 %: THIRD QUARTILE, OR 75TH PERCENTILE. MAX : MAXIMUM VALUE IN EACH COLUMN .
FINDING K VALUE IMPORT LIBRARIES: "FROM SKLEARN.CLUSTER IMPORT KMEANS" THIS LINE IMPORTS THE KMEANS CLUSTERING ALGORITHM FROM THE SCIKIT-LEARN LIBRARY, WHICH IS A WIDELY USED MACHINE LEARNING LIBRARY IN PYTHON. INITIALIZE AN EMPTY LIST: "WCSS = []" THIS LINE INITIALIZES AN EMPTY LIST CALLED WCSS. IT WILL BE USED TO STORE THE WITHIN-CLUSTER SUM OF SQUARES (WCSS) FOR DIFFERENT VALUES OF K. LOOP OVER K VALUES: THIS LOOP ITERATES OVER A RANGE OF VALUES FOR K FROM 1 TO 10. INSTANTIATE KMEANS MODEL : "KMEANS = KMEANS(N_CLUSTERS=K, INIT="K-MEANS++")" INSIDE THE LOOP, A KMEANS MODEL IS INSTANTIATED WITH N_CLUSTERS=K, WHERE K IS THE CURRENT VALUE OF K. THEN SPECIFY THE INITIALIZATION METHOD FOR CENTROIDS, WHICH IS "K-MEANS++" . THIS INITIALIZATION METHOD HELPS IN CHOOSING INITIAL CLUSTER CENTROIDS IN A WAY THAT SPEEDS UP CONVERGENCE.
FINDING K VALUE FIT KMEANS MODEL: THE KMEANS MODEL IS FITTED TO THE DATA USING THE FIT METHOD. THE DATA USED FOR CLUSTERING IS OBTAINED FROM THE DATAFRAME DF BY EXCLUDING THE FIRST COLUMN . THIS ASSUMES THAT THE FIRST COLUMN CONTAINS LABELS OR IDENTIFIERS AND THE REMAINING COLUMNS ARE FEATURES USED FOR CLUSTERING. COMPUTE WCSS: THE WITHIN-CLUSTER SUM OF SQUARES (WCSS) IS COMPUTED. WCSS REPRESENTS THE SUM OF SQUARED DISTANCES OF SAMPLES TO THEIR CLOSEST CLUSTER CENTER. AFTER COMPUTING WCSS FOR ALL VALUES OF K, A LINE PLOT IS CREATED. THE X-AXIS REPRESENTS THE VALUES OF K (FROM 1 TO 10), AND THE Y-AXIS REPRESENTS THE CORRESPONDING WCSS VALUES. THE PLOT VISUALIZES THE RELATIONSHIP BETWEEN K VALUES AND WCSS .. FINALLY , THE PLOT IS DISPLAYED .
FINDING K VALUE
1. KMEANS CLUSTERING: START WITH INITIALIZING A KMEANS CLUSTERING MODEL WITH 2 CLUSTERS . THEN THE FIT_PREDICT METHOD IS USED TO BOTH FIT THE MODEL TO THE DATA AND PREDICT THE CLUSTER LABELS FOR EACH DATA POINT. THE CLUSTER LABELS ARE ASSIGNED TO THE DATAFRAME DF AS A NEW COLUMN NAMED "LABEL". 2. 3D SCATTER PLOT VISUALIZATION: FOR 3D SCATTER PLOT VISUALIZATION CREATE A NEW FIGURE WITH A SPECIFIED SIZE FOR THE PLOT. THEN CREATE A 3D SUBPLOT WITHIN THE FIGURE. SCATTER PLOTS ARE CREATED FOR EACH CLUSTER LABEL. CLUSTERING MODEL DEVELOPMENT
CLUSTERING MODEL DEVELOPMENT THEN CREATE A SCATTER PLOT FOR DATA POINTS BELONGING TO CLUSTER LABEL 0. THE X, Y, AND Z COORDINATES ARE SPECIFIED AS OCCUR_YEAR, LONGITUDE, AND LATITUDE, RESPECTIVELY . DATA POINTS BELONGING TO THIS CLUSTER ARE PLOTTED IN BLUE. SIMILAR SCATTER PLOTS ARE CREATED FOR OTHER CLUSTER LABELS (E.G., CLUSTER 1) WITH DIFFERENT COLORS (E.G., RED). THEN ADJUST THE VIEW ANGLE OF THE 3D PLOT. THEN DISPLAY A LEGEND SHOWING THE CLUSTER LABELS. THUS THE 3D SCATTER PLOTTING IS SUCESSFULLY COMPLETED
CLUSTERING MODEL DEVELOPMENT
CLUSTERING MODEL DEVELOPMENT
VISUALIZATION DATA EXPLORATION AND PREPROCESSING INVOLVE UTILIZING VISUALIZATION TECHNIQUES SUCH AS HISTOGRAMS, SCATTER PLOTS, BOX PLOTS, AND HEATMAPS. THESE VISUALIZATIONS AID IN UNDERSTANDING THE DISTRIBUTION, RELATIONSHIPS, AND POTENTIAL OUTLIERS IN THE DATA. THEY ARE CRUCIAL FOR MAKING DECISIONS ABOUT PREPROCESSING STEPS SUCH AS FEATURE SCALING, OUTLIER REMOVAL, AND FEATURE ENGINEERING . IN THIS PROJECT I HAVE VISUALIZED THE GIVEN DATA IN POWER BI. THE LINK FOR MY POWER BI REPRESENTATION IS GIVEN BELOW LINK: https://app.powerbi.com/view?r=eyJrIjoiYzU2MGMzMGEtZGYwZS00MDY2LWI0YTItOTI4MGY2ZGNhNWI0IiwidCI6IjUzODhhOWI3LWUzOWQtNDZhMS1hZDQ5LTRiMjMwMjg5MzYzYiJ9
VISUALIZATION
CONCLUSION THUS A CLUSTERING MODEL WAS DEVELOPED AND THESE STEPS PLAYED A SIGNIFICANT ROLE IN DEVELOPING A CLUSTERING MODEL DATA PREPROCESSING FEATURE ENGINEERING CLUSTERING MODEL DEVELOPMENT VISUALIZATION