FAKE JOB PREDICTION USING MACHINE LEARNING Presented By A. Rupasri (20NE1A0510) SK. Rehamunnisha (20NE1A0539) D. Sai Supriya (20NE1A0542) SK. Mohammad Fahim (20NE1A0552) Under The Esteemed Guidance Of Mrs. J. Lakshmi B. Tech, M. Tech Assoc. Professor
CONTENTS Introduction Abstract Existing System and Proposed System System Architecture Design Implementation code Output Screens
INTRODUCTION Employment scam is one of the serious issues in recent times addressed in the domain of Online Recruitment Frauds. In recent days, many companies prefer to post their vacancies online so that these can be accessed easily and timely by the job-seekers. However, this intention may be one type of scam by the fraud people because they offer employment to job-seekers in terms of taking money from them. Fraudulent job advertisements can be posted against a reputed company for violating their credibility. These fraudulent job post detection draws a good attention for obtaining an automated tool for identifying fake jobs and reporting them to people for avoiding application for such jobs. For this purpose, machine learning approach is applied which employs several classification algorithms for recognizing fake posts. In this case, a classification tool isolates fake job posts from a larger set of job advertisements and alerts the user. To address the problem of identifying scams on job posting, supervised learning algorithm as classification techniques are considered initially. A classifier maps input variable to target classes by considering training data. Classifiers addressed in the paper for identifying fake job posts from the others are described briefly. These classifiers based prediction may be broadly categorized into -Single Classifier based Prediction and Ensemble Classifiers based Prediction.
ABSTRACT As can be seen from increased number of data and privacy breaches(stolen) day-by-day it becomes extremely difficult for one to stay safe online. Number of victims of fake job posting is increasing drastically day by day. The companies and fraudsters tempt the job-seekers by various methods, majority coming from digital job-providing web sites. Our target is to minimize the number of such frauds by using Machine Learning to predict the chances of a job being fake so that the candidate can stay alert and take informed decisions, if required. The model will use NLP(Natural Language Processing) to analyze the sentiments and pattern in the job posting. The model will be trained as a Sequential Neural Network and using very popular Glove algorithm( Glove is an unsupervised learning algorithm for obtaining vector representations for words). To understand the accuracy in real world, we will use trained model to predict jobs posted. Then we have worked on improving the model through various methods to make it robust(strong)and realistic.
EXISTING SYSTEM Fake or real job prediction predicts all job details. Even though, those records are not used in an efficient manner for prediction. To maintain the records in an efficient error free manner, the new proposed system is introduced. Disadvantages Doesn’t generate accurate and efficient results Computation time is very high Difficulty in maintenance of job details Lack of accuracy may result in lack of efficient further treatment
PROPOSED SYSTEM We proposed to develop a system which will help to predict Fake or real job based on some attributes like location, salary_range, department and so on. So, there is a need for developing a decision system which will help to predict the job condition for employee in an easier way, which can offer prediction about the job so that further procedure can be made effectively. This proposed system not only accurately predicts fake jobs but also reduces time for prediction. The machine learning algorithms like decision tree, randomforest, Naive Bayes, K Nearest Neighbours have proven to be most accurate & reliable and hence, used in this project. Advantages Generates accurate and efficient results Computation time is greatly reduced Easy maintenance of employee details Reduces manual work Automated prediction
SYSTEM ARCHITECTURE The project is to find the phoney jobs to avoid users getting into the scams. This makes assurance that the data they provide at the time of recruitment will not be misused. we are working on a EMSCAD dataset to find better results using different algorithms. The dataset for fake job post is collected and preprocessed. The feature selection is the process of selecting some important features from the data required for analyzing and getting a proper output. We are applying the random Forest classifier to detect whether the job posted is a fake or a legitimate one. Fig: System architecture
DATA SET DETAILS Fake Job Description Prediction dataset: [Real or Fake]Out of the 18K job descriptions in this dataset, around 800 are fraudulent. The information is made up of both textual and job-related meta-data. Using the dataset, classification models may be built that can identify job descriptions that are false. Create a classification model using the text data features and meta-features to identify the genuine and fake job descriptions. Determine the essential characteristics or aspects (words, entities, or phrases) of job descriptions that are fraudulent in nature.
The fig1 below is a heatmap representing the missing values. After filling the missing values our heat map looks like
DESIGN: U ML DIAGRAMS The Unified Modeling Language allows the software engineer to express an analysis model using the modeling notation that is governed by a set of syntactic semantic and pragmatic rules. A UML system is represented using five different views that describe the system from distinctly different perspective. Each view is defined by a set of diagram, which is as follows. User Model View: This view represents the system from the users perspective. Structural Model view: In this model the data and functionality are arrived from inside the system. Behavioral Model View: It represents the dynamic of behavioral as parts of the system, depicting the interactions of collection between various structural elements described in the user model and structural model view. Implementation Model View: In this the structural and behavioral as parts of the system are represented as they are to be built. Environmental Model View: In this the structural and behavioral aspects of the environment in which the system is to be implemented are represented.
Class Diagram The class diagram is the main building block of object oriented modeling. It is used both for general conceptual modeling of the systematic of the application, and for detailed modeling translating the models into programming code. Class diagrams can also be used for data modeling. The classes in a class diagram represent both the main objects, interactions in the application and the classes to be programmed. A class with three sections, in the diagram, classes is represented with boxes which contain three parts: The upper part holds the name of the class The middle part contains the attributes of the class The bottom part gives the methods or operations the class can take or undertake.
Use case Diagram A Use Case Diagram at its simplest is a representation of a user's interaction with the system and depicting the specifications of a use case. A use case diagram can portray the different types of users of a system and the various ways that they interact with the system. This type of diagram is typically used in conjunction with the textual use case and will often be accompanied by other types of diagrams as well.
Sequence Diagram A Sequence Diagram is a kind of interaction diagram that shows how processes operate with one another and in what order. It is a construct of a Message Sequence Chart. A sequence diagram shows object interactions arranged in time sequence. It depicts the objects and classes involved in the scenario and the sequence of messages exchanged between the objects needed to carry out the functionality of the scenario. Sequence diagrams are typically associated with use case realizations in the Logical View of the system under development. Sequence diagrams are sometimes called event diagrams, event scenarios, and timing diagrams.
IMPLEMENTATION Sample code LOAD DATASET: #Importing Libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import string # Reading Dataset # Dataset is from https://www.kaggle.com/amruthjithrajvr/recruitment-scam data= pd.read_csv ("C:\Program Files (x86)\Google\\fake_job_postings.csv") # Reading top 5 rows of our dataset data.head ()
# To check the number of rows and column data.shape (17880, 18) data.columns Index([' job_id ', 'title', 'location', 'department', ' salary_range ', ' company_profile ', 'description', 'requirements', 'benefits', 'telecommuting', ' has_company_logo ', ' has_questions ', ' employment_type ', ' required_experience ', ' required_education ', 'industry', 'function', 'fraudulent'], dtype ='object') # let us check the missing values in our dataset data.isnull ().sum()
# Let us remove the columns which are not necessary # axis =1 specifies that the values are column value and inplace =true to make these changes permanent ( ie . make these dropes of columns permanent in the data set) # We have droped salary range because 70% approx null value # also job_id and other irrelvent columns because they does not have any logical meaning data.drop ([' job_id ', ' salary_range ', 'telecommuting', ' has_company_logo ', ' has_questions '],axis=1,inplace = True) data.shape data.head () # Fill NaN values with blank space # inplace =true to make this change in the dataset permanent data.fillna (' ', inplace =True)
# Checking for distribution of class label(percentages belonging to real class and percentages belonging to fraud class) # in the data 1 indicates fraud post # 0 indicating real post # Plotting pie chart for the data labels = 'Fake', 'Real' sizes = [ data.fraudulent [data['fraudulent']== 1].count(), data.fraudulent [data['fraudulent']== 0].count()] explode = (0, 0.1) fig1, ax1 = plt.subplots ( figsize =(8, 6)) #size of the pie chart ax1.pie(sizes, explode=explode, labels=labels, autopct ='%1.2f%%', shadow=True, startangle =120) #autopct %1.2f%% for 2 digit precision ax1.axis('equal') plt.title ("Proportion of Fraudulent", size = 7) plt.show ()
# visualizing jobs based on experience experience = dict ( data.required_experience.value_counts ()) del experience[' '] plt.figure ( figsize =(12,9)) plt.bar ( experience.keys (), experience.values ()) plt.title ('No. of Jobs with Experience') plt.xlabel ('Experience', size=10) plt.ylabel ('No. of jobs', size=10) plt.xticks (rotation=35) plt.show () # convert text to feature vectors input_data_features = vect.transform ( input_text ) # making prediction prediction = dt.predict ( input_data_features ) print(prediction) if (prediction[0]==1): print(' Fraudulant Job') else: print('Real Job') [0] Real Job
Fig: Output Screen For Home Page OUTPUT SCREENS
DISPLAYING THE FAKE OR REAL Fig: Common words in real job posting Fig: Common words in fake job posting