Phishing Link Checker Using a hybrid approach.pptx
shreyaschaware1
9 views
19 slides
Nov 01, 2025
Slide 1 of 19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
About This Presentation
It describes the new methodology of working in domain of phishing link detection.
A new hybrid approach has been introduced through this presentation.
Size: 794.29 KB
Language: en
Added: Nov 01, 2025
Slides: 19 pages
Slide Content
Phishing Link Checker Sarvesh Yashwant Redekar , Megh Harshad Malvankar , Miraj Raghunath Wadekar , Shreyash Subhash Chaware Project Guide: Prof. S. S. Kadam [email protected] Department of Computer Science And Engineering(AIML) Sindhudurg Shikshan Prasarak Mandal’s College of Engineering, Kankavli .
Outline
Introduction Phishing websites are one of the most common and dangerous threats on the internet. They impersonate legitimate platforms—such as banking portals, e-commerce sites, or social media pages. Because attackers constantly evolve their tactics, relying on manual detection or blacklists alone is not practical. Automated and adaptive solutions are essential to stay ahead.
In this work, we present a hybrid approach to phishing website detection. On the client side, we deploy lightweight machine learning models that can quickly flag suspicious sites in real time, ensuring minimal delays for users. Introduction When the system encounters uncertain or ambiguous cases, it escalates the analysis to the server side, where a large language model performs deeper semantic and contextual checks. This layered design balances speed with accuracy, offering a scalable and effective defense against phishing attempts.
Literature Review Existing phishing detection research falls into two categories. On one side, ML-based approaches like XGBoost and Random Forest give good accuracy and speed, but they rely mainly on URL or metadata features and miss deeper semantic or visual cues. On the other side, LLM-based and multimodal systems can analyze text, HTML, and even screenshots with high accuracy, but they are computationally expensive and not practical for real-time protection in a browser. None of these works focus on direct end-user deployment. Our project addresses this gap by combining the strengths of both: a lightweight ML model runs locally in the Chrome extension for instant decisions, and when the model is unsure, an LLM provides deeper semantic analysis via the backend. This hybrid design ensures speed, accuracy, and explainability while working in real-time for everyday users.
Literature Review Title Year Author Methodology Research Gaps [1] PhishDebate : An LLM-Based Multi-Agent Framework for Phishing Website Detection 2025 Wenhao Li, Selvakumar Manickam, Yung-Wey Chong, Shankar Karuppayah Proposes a modular multi-agent system using LLMs with debate mechanism; agents specialize in different aspects (URL, HTML, content semantics, brand); achieves high accuracy and interpretability. Focused only on server-side frameworks; no practical lightweight deployment (e.g., browser extensions). High computation cost, lacks real-time blocking mechanism for end users. [2] An assessment framework for explainable AI with applications to cybersecurity 2025 Maria Carla Calzarossa , Paolo Giudici, Rasha Zieni Proposed a comparative framework for explainability using Shapley and Lorenz Zonoid -based methods; applied to phishing detection with 48 web page features. Strong on explainability but limited in scalability and real-world integration . No client-side or browser implementation; explainability methods not connected to user-facing alerts . [3] PhishDebate : Multi-Agent LLM Framework 2025 W. Li et al. ( arXiv ) Multi-agent LLM debate (URL, HTML, semantics, brand);Real-world phishing webpages (collected) Uses collected datasets , but no demonstration of real-time protection . Heavy reliance on LLM makes it unsuitable for lightweight Chrome extensions ; missing hybrid approach with fast ML + fallback LLM . [4] PhishEmailLLM : Meta-model 2025 Nair et al. (ACM) Meta-model combining LLM signals + classical detectors ;Enron , Nazario, curated corpora Focuses on email phishing , not websites . Lacks browser integration and real-time URL interception . Needs adaptation for web phishing + Chrome extension deployment .
Title Year Author Methodology Research Gap [5] Evolution of Phishing Detection with AI 2025 Various ( arXiv ) Comparative evaluation: ML, DL, quantized LLMs;Public benchmarks Focuses only on benchmark comparisons; no real-time system design. Lacks deployment in lightweight environments like browser extensions. [6] Enhancing Phishing Detection through Explainable AI 2025 Eilertsen et al. ( arXiv ) LLM-assisted explainability + classical ML;Public phishing datasets Does not address real-time blocking or integration into user-facing tools such as Chrome extensions. [7] KnowPhish : LLM + Multimodal KG 2024 Li et al. (USENIX Sec'24) Multimodal LLM + Knowledge Graph;PhishTank , curated refs Relies on heavy knowledge graph + LLM infrastructure, unsuitable for lightweight, client-side environments. [8] Detecting Phishing Websites Using HTML, JavaScript and Image Semantics with Large Language Models 2024 Tanmay Ranjan, Yaman Kumar, Rajeev Shorey Combines image and JavaScript code semantics processed via LLMs for phishing detection; emphasizes combining static and dynamic analysis techniques using prompt engineering. Accurate but computationally expensive; lacks hybrid ML+LLM approach for efficiency. No implementation in browser extensions for real-world deployment. [9] Multimodal Large Language Models for Phishing Webpage Detection and Identification 2024 Jehyun Lee, Peiyuan Lim, Bryan Hooi, Dinil Mon Divakaran Introduces a two-phase system using multimodal LLMs (GPT-4, Claude 3, GeminiPro ) to detect phishing by comparing webpage appearance/HTML to known brands; emphasizes explainability and robustness .Missing integration with lightweight ML models for faster client-side detection. No mention of real-time Chrome extension implementation Literature Review
Traditional phishing detection systems rely only on URL or metadata features , which makes them fast but prone to false negatives when facing sophisticated phishing sites that mimic real brands. LLM-based solutions provide deeper semantic and multimodal analysis but are computationally expensive , slow , and not suitable for real-time browser-based protection. Most existing research focuses on offline datasets and benchmark testing ; there is a lack of lightweight , deployable solutions that can operate directly inside a user’s browser to block phishing sites in real time. Explainability is often missing in phishing detection systems, leaving users unaware of why a site was flagged, which reduces trust and adoption . There is currently no hybrid approach that combines the speed of ML with the semantic reasoning of LLMs in a Chrome extension for end-user protection. Problem Statements
Proposed System Data Collection Feature Engineering ( XGBoost ) Export ML Model (JSON) Web Page Analyse using ML Page load LLM Analysis Send HTML Client Alert & Allow Verdict Client Alert & Block Fig: Overview of the System
Proposed Syste m
1.Lexical Based URL length, number of dots, presence of ”@”. 2.Host/Network Based Domain age, IP Reputation, SSL Certificate. 3.Content Based Page Content, Keywords and Semantic Meaning Factors / Rules of Classification
Methodologies Data Collection & Preprocessing Gather URL and webpage data from PhishTank , OpenPhish , Alexa top sites. Label as “phishing” or “safe.” Extract features: URL length, “@” symbol, HTTPS flag, form/ iframe counts, subdomains, domain age, PageRank, etc. Local ML Model (Client-Side) Feature Extraction in content.js / utils.js. Rule-based classifier or lightweight XGBoost model (converted to JSON or JS logic). Three outcomes: phishing → block immediately safe → allow page unsure → send to backend
Methodologies Semantic LLM Analysis (Server-Side) Flask API receives HTML when ML is “unsure.” Prompt GPT-4 (or smaller LLM) with: You are a security analyst. Analyze this HTML and decide: phishing or safe? Justify. [HTML HERE] Return: { verdict, explanation } Decision Fusion & Explainability Block if either ML or LLM flags phishing Show natural-language explanation from LLM in popup Log decisions for retraining and analytics
Chrome Extension Folder: extension/ manifest.json ← Permissions & scripts content.js ← Injects on every page, extracts features, runs ML logic utils.js ← Feature extraction & rule-based/ XGBoost logic popup.html & popup.js ← UI showing “Safe” / “Blocked” + explanation model.json ← Serialized XGBoos t thresholds/trees Flask Backend Folder: backend/ app.py ← Flask app with CORS, `/analyze` route phishing_llm.py ← LLM integration via OpenAI API requirements.txt ← flask, flask- cors , openai Implementation Details
Expected outcomes A Chrome Extension that: Detects phishing using a fast, local ML model. Falls back to a secure backend LLM analysis when needed. Blocks phishing attempts in real time . Offers natural language explanations for its decision. Features: Real-time local detection (ML) Deep semantic analysis (LLM) Lightweight and privacy-respecting User-friendly alerts and explanations Highly extensible
Conclusion A hybrid phishing website detection system that integrates Machine Learning (ML) and Large Language Models (LLMs) into a practical, real-time Chrome extension. The ML component ( XGBoost / rule-based model) provides lightweight, fast, and local detection, ensuring that users are protected instantly without heavy computation. The LLM component, hosted on a backend server, offers semantic and contextual analysis of suspicious webpages, enabling detection of sophisticated phishing attempts that evade traditional feature-based methods. By combining these two approaches, the system achieves a balance of speed, accuracy, and explainability, overcoming the limitations of ML-only or LLM-only solutions. The Chrome extension ensures real-time user protection by automatically blocking phishing sites and providing clear explanations for flagged pages. The system is also scalable and adaptable, as the ML model can be retrained periodically with new phishing datasets, and the LLM prompts can be refined for evolving attack strategies.
References [1] Li, W., Manickam, S., Chong, Y.-W., & Karuppayah , S. (2025). PhishDebate : An LLM-Based Multi-Agent Framework for Phishing Website Detection. arXiv preprint arXiv:2506.15656. https://arxiv.org/abs/2506.15656 [2] Calzarossa , M. C., Giudici, P., & Zieni , R. (2025). An assessment framework for explainable AI with applications to cybersecurity. arXiv preprint arXiv:2502.12345. https://arxiv.org/abs/2502.12345 [3] Li, W., Manickam, S., Chong, Y.-W., & Karuppayah , S. (2025). PhishDebate : Multi-Agent LLM Framework. arXiv preprint arXiv:2506.15656. https://arxiv.org/abs/2506.15656 [5] Nair, R., et al. (2025). PhishEmailLLM : A Meta-Model for Phishing Email Detection Using LLMs. ACM Digital Library. [6] Anonymous. (2025). Evolution of Phishing Detection with AI. arXiv preprint arXiv:2503.xxxxx.
References [7] Li, W., et al. (2024). KnowPhish : Multimodal LLM and Knowledge Graph for Phishing Detection. USENIX Security Symposium 2024. [8] Ranjan, T., Kumar, Y., & Shorey, R. (2024). Detecting Phishing Websites Using HTML, JavaScript and Image Semantics with Large Language Models. arXiv preprint arXiv:2407.20361. https://arxiv.org/abs/2407.20361 [9] Lee, J., Lim, P., Hooi, B., & Divakaran , D. M. (2024). Multimodal Large Language Models for Phishing Webpage Detection and Identification. arXiv preprint arXiv:2408.05941. https://arxiv.org/abs/2408.05941 [10] Eilertsen, G., et al. (2025). Enhancing Phishing Detection through Explainable AI. arXiv preprint arXiv:2504.xxxxx.