ML Workshop GMIS 2025 by Dr. Lynne Grewe, Ujas Goti, Nidhi Prajapati, and Naman Rajani. Supported by Google Developer Groups on campus, CSUEB. Using Google Cloud, Colab and Gemini credits provided by Google.
○n8n AI Workflow Creation Instructions w/ Video (40 minutes)
○Discussion (5 minutes)
○Github : https://github.com/Naman16rajani/ml-workshop-n8n
4)Further Investigation/Resources
○Backpropagation of loss in fully connected layers video
○Read about Optimizers - functions/algorithms to alter weights & learning rates
○n8n resources
Google Cloud Setup for ML training +Colab Setup
Google Cloud Setup
1)Google Cloud Account creation (pre-req) - do NOT put in a credit card Must do with
@gmail.com email address
2)Google Cloud Credits
3)Google Cloud Project Setup,
3.1) Create new project
●name your project ml-workshop
●Select billing to the “Google Cloud Platform Trial Billing Account”
●You can see your see your project in project select widget
3.2) Enable APIs for your project
Google Cloud Console → APIs & Services → Library
1.Select your project at top (ml-workshop)
2.Open APIs & Services → Library.
In the search bar, enter each API & Click Enable for each one
○Vertex AI API
○Compute Engine API
○Cloud Storage API (aka “Cloud Storage JSON API”)
○(Optional) Artifact Registry API — only if you plan to run custom
containers now or later.
For example enabling Compute Engine API
Colab Setup
1) (make copy in your drive) File->Save As ( store in drive location -- default Colab Notebooks)
2) Colab Secrets Setup
_
3) (INFO ONLY) Service Account & Roles - authentication will take place by you logging in during
run of colab
In your Colab notebook:
1.Click the ?????? Secrets icon in the left sidebar (or Manage → Secrets from the toolbar).
2.Create secrets with these names (keys) and values:
○ GOOGLE_CLOUD_PROJECT → your GCP Project ID ( ml-workshop-****)
○ GOOGLE_DEFAULT_REGION → your region ( us-central1)
○ GOOGLE_DEFAULT_BUCKET → your bucket name without gs:// (set it to
my-gemini-sft-bucket)
How to find Your Project ID (not name) console→ Click Project Name → see ID
Use your Google Cloud login/password. NOTE: Owner is a basic, all-powerful role
(includes create/read/write on Storage, starting Vertex AI jobs, etc.). You don’t need to add the
2.Intuition based understanding
of DL based Classification -
Convolutional Neural Network
3.Foundational Models
4.n8n - AI workflows
3
Motivation - a sampling of applications
Image/Video Understanding
Transportation
Manufacturing
Medical/Health
Entertainment Sports
Surveillance and Security
Human-computer interfaces
Generative applications
Assistive technology
Commerce
4
Text/Data Understanding
Education
Generative
Service Industry
Assistive
Interface
Financial
Science / BioTech
Security/Fraud
Commerce
What is ML?
term machine learning - 1959 by Arthur
Samuel, an IBM employee and pioneer in the
field of computer gaming and artificial intelligence.
from greek for geeks:
AI = algorithms which exhibits intelligence through
decision making.
ML =AI algorithm allows system to learn from data.
DL = ML algorithm uses deep(more than one layer) neural
networks to analyze data and provide output accordingly.
5
ML History: Many Techniques to learn
•Statistical Modeling
•Structural Modeling
•Database Driven (similar)
•Expert Systems
•Training Based Learning
•Hypothesize and test
•Template Matching
•Interpretation Trees
•K-Nearest Neighbors
•Statistical Modeling
•Geometric Hashing, Indexing/ Hashing
•Neural Networks
•Eigenspaces
•Hidden Markov Models
•Support Vector Machines
•Deep Learning Networks 6
Features:
ear pointiness
versus
size of body
A timeline of Deep Learning advances
...timelines do not mean that some techniques are no longer used
7
Foundational
Multi-Modal
Models
Intuition: ML (not Deep Learning) - Skin Cancer
•Explicit Feature Extraction --> Recognition Technique
8
K-Nearest Neighbor : vote who is nearby?
Statistical: Where are the boundaries
Expert Systems: High Texture + round boundary = Squamous cell
SVM : Support Vector Machines -Boundaries
MANY MORE......
boundary shape
Intuition: Deep Learning - Skin cancer
LEARN Features + Recognition- at the same time
9
Deep Learning: using Convolutional
Neural Network for classification
Malignant
Benign
Recognition
Learning Features
More Deep Learning
A start in Deep Learning --
Classification & the CNN
(Convolutional Neural Network)
SIMPLE CNN example -- Intuition
CNN = Convolutional Neural Network
•First = Convolutional Layers --> learn features to extract
•Next = Fully Connected Layers --> to reason over the features to
learn to classify the images
•Logistic Classifier: this is a NN that the output layer has nodes that
sum up to 1.0 --- meaning 100% probability. So each output node
represents probability of that "thing it represents" occurring.
11
ML model to detect
Road Speed Signs
Classification Problem:
What Speed Sign is it?
CNN - Convolutional Layers
Intuition = like the neurons in our brain inside our visual
cortex --the first layers learn primitive features like edges
and corners.
•First = Convolutional Layers --> learn features to extract
•Capture local patterns ??????good
for imaging applications
12
CNN - Fully Connected Layers
Intuition = like the neurons in our brain that reason using extracted
features from the visual cortex
•Fully Connected Layers --> to reason over the features to learn to
classify the images (what speed sign?)
13
Intuition -- X/O classification
SIMPLE example –want to detect X or O
What we need –training data
•Need LOTS, LOTS of data
How much training data --- examples for some forms of object recognition -
could go into 100,000 samples or more. Sometimes like 10,000 or maybe for
the really really simple example we show here we might get away with 200?
Google Open Image
database – million urls
to labeled images
Convolutional Layer- MxM filters
•Example apply Three 5x5 filters to a 32x32 image (1,024 values)
•Output = Three 28x28 data values (2,352 values)
NOTE: we lose 2 rows at top & bottom
and 2 columns at left and right borders
as can not process those with 5x5 mask
?????? so output goes from 32x32 to 28x28
Not done yet ---CNN needs NON-LINEARITY
we must introduce a non-linearity activation function and do
so for CNN at each data sample by applying Rectified Linear
Units: set values < 0 to 0
This seems like magic –how did they come up with this? Why this?
Answer: in Neural Networks (came before Deep Learning) first
developed concept & more elaborate functions like sigmas are used
SOME Hand Waving -- to be able to solve more
complex problems we must insert a nonlinear
function in our network.
source
source
Convolutional Layer order example
= Convolution with set of MxM filters + nonlinear + max pooling (downsample)
Example with multiple convolutional layers
You select the number of layers
TIP: look at what others have done to solve similar problems
NEXT: Fully Connected Layers --> Need a
Decision
•Need fully connected Layers
•Last Layer -decision Layer
•you must decide how many fully connected layers and dimensions
(output =#classes)
Weights
will be learned
through training
Fully Connected Layer
Weights applied to incoming values, summed and then an activation
(non linear) function applied to get a new output
21
What happens at each new Output Value generated
what happens at last layer
•Get a vote for X and for O ---> You decide if there is a clear winner
Final structure to our example CNN
•5 “layers” – 3 convolutional and 2 fully connected
Network Architecture - questions
•How many layers?
•How large each layer?
•Activation Functions?
•Structure of Network?
>>> Use existing architectures to solve similar problems
>>> Guidelines (take an ML course)
>>> Search for best (an optimization problem)
24
TIP:
There are many famous Network
Architectures you can use and are even
programmed for you in frameworks like
Tensorflow and Pytorch.
Some are even trained for you --so you can
retrain it with your data (often gives better
performance AND reduces training time)
Network
Architecture
- and
Retraining
25
●Famous
Architectures
●Frameworks
(Tensorflow,
Pytorch, etc)
implement
●Pretrained --so
you can retrain it
w/ your data (can
give better
performance &
reduces training
time)
Training --> the goal
Learn the weights of the network (best convolutional filters, best
combinations in fully connected layers) so that we minimize the
ERROR/COST (loss)
Error Function or Cost Function = measures the difference between the
output vector calculated and what its value should be (ground truth)
TRAINING: change A LITTLE network weights PERIODICALLY
so that error(cost function) is reduced SOME
BACKPROPOGATE error (backwards)
through layers update weights @ each layer
26Layer X
Gif Source from 3Blue1Brown, Chapter 3, Deep Learning.
TRUTH
1.0
0.0
cost/error = (1.0 -0.92 + 0.0-0.51)
2
Training and what happens to the
weights of the network --and how?
Epoch = presenting all of the training data once through the network
Batch Size = number of training data that run through the network in
an epoch BEFORE updating the network weights.
•estimate the error gradient (cost function) to update the
model weights a little in the right direction
27
HOW much do we alter the network weights
during training? --> LEARNING RATE
How Much do you alter the weights of the network to correct for
errors at each step in training?
•Determined by the LEARNING RATE
•typically higher in the beginning
•slows down later as we get more confidence in our network weights
•there are many algorithms for this.
>>>CAREFUL: if your learning rate is too high
you can “overtrain” to your training data (overfit)
>>> NOTE: if the learning rate is too low you
will take a long time to learn the weights.
28
Training & Parameter Turning
Parameter Tuning - change values of parameters and retrain network
to see if get better results. These parameters include:
•Batch Size
•Number Epochs
•Activation Function selection
•Learning Rates
MORE......
>>> you can change these to get different results. (empirically, grid
search, optimization search problem)
>>> use common values for size of data and network.
>>> experience will guide you
29
EVALUATION: How Good is Your ML Model
(testing) - Metrics
•Accuracy = % of data that is correctly classified (100% is
perfect)
•Loss = number indicating how bad the model's prediction
was on the test examples. 0 means perfect
OTHERS
•Precision (and mean Average
Precision)
•Recall (and mean Recall)
•Confusion Matrix
•more....
30
Data Split --> Train, Validate, Test
•Typically split 80/20 (tain + valid/test) or 80/10/10
(train/valid/test) or something similar
31
Transformers --> leading to
Foundational Models
32
Foundational
Multi-Modal
Models
Vision Transformers
–did you know Google
wanted to call it an
“Attention Network” rather
than Transformer early on
33
à“Attention is All you Need” , Google [2017]
à The Transformer Architecture (NLP focus)
Attention --> Vision Transformers
Where to focus on in the image
Xu, et. al. “Show, Attend and Tell: Neural
Image Caption Generation with Visual
Attention” [2016]
34
Transformers are
neural networks
consisting of an
encoder-decoder
architecture with
self-attention
mechanisms.
Attention TOWARDS Vision Transformers
The Transformer
Multi-Head Attention –
can run in parallel each part
of the sequence (helps make
it more efficient to train)
36
Transformers are neural networks
consisting of an encoder-decoder
architecture with self-attention
mechanisms.
How Images Enter a Vision Transformer:
“From Pixels to Tokens: Patch Embedding”
•First layers of p patches uses learned Convolution layers pxp , 3->D, stride=p
OR can use CNN on original image to get features.
37
VLM – Vision Language Models
& Foundational Models
Mutli-modal (vision+text): Transforming AI
•From healthcare to manufacturing,
finance to entertainment, large
vision models have/will become
indispensable assets
Distinctive Features of Large (Foundational)
Models: Parameters and Scale
•LARGE # Parameters: The Driving Force Behind Large Vision Models
•enabling it to capture intricate patterns and nuances within visual data.
•Deep Architectures: deep network architectures, comprising multiple layers of
interconnected nodes.
•Transfer Learning Capabilities: excel in transfer learning, where a pre-trained model on a
massive dataset can be fine-tuned for specific tasks with relatively smaller datasets.
•This adaptability makes them versatile across applications, from medical imaging to industrial quality control.
•Massive Datasets: Large vision models thrive on extensive datasets encompassing vast
arrays of visual information
•Computational Intensity: The training process is computationally demanding, often
requiring GPUs or TPUs.
•Real-Time Inference Challenges: While training benefits from abundant computational
resources, deploying models on edge devices necessitates optimizing for real-time
inference in resource-constrained environments.
Visual Question Answering (VQA):
Example — 'What is the man jumping over?' → 'Fire hydrant'.
Visual Grounding:
Example — 'Yellow fire hydrant' → associated region in the image.
Image-Text Matching:
Example — 'A cat is lying on a bed' → output True/False.
Key Notes:
•- VLMs perform diverse tasks
beyond traditional recognition.
•- Additional intelligence
emerges when integrating large
language models with vision
systems.
Foundational models have Vision Encoders
Visualization of vision encoder outputs using attention maps:
•Example image: a person in a red dress walking in an urban environment.
Heatmaps illustrate focus areas for different queries:
•- 'Look at that person': Strong attention on the person in red.
•- 'What is the name of that hotel?': Attention shifts to the building in the background.
•- 'What is that?': Broader attention across multiple regions, balancing object
recognition with scene context.
Example Foundational Models
•Gemini: A family of multimodal models by Google DeepMind released in late 2023. The
top-tier Ultra variant reaches around 540 billion parameters, with Pro around 280 billion.
•Meta’s LLaMA 4: Announced in April 2025, with Scout and Maverick versions each at
~17B parameters. A super-sized Behemoth model (288B) is in development.
•GPT-4V: The vision-enabled extension to GPT-4 launched around September 2023.
Parameter counts remain undisclosed by OpenAI.
•Qwen-VL Series: Alibaba's multimodal models—first Qwen-VL in 2023, upgraded to
Qwen2.5-VL in early 2025. These come in multiple sizes including the largest 72B-parameter
variant.
A foundation model is a
large-scale model
trained on broad,
diverse datasets (text,
images, speech,
structured data, etc.)
that can be adapted to
many downstream
tasks.
Variants + Dataset size for VLMs
Model Family Launch Date
Smallest
Variant
(params)
Largest Variant
(params)
Dataset Size
Gemini (Google
DeepMind)
Dec 2023
Gemini Nano
(~1.8B est.)
Gemini Ultra
(~540B)
Multimodal
(text+images+code+audi
o); trillions of tokens +
billions of images
(DeepMind not fully
disclosed)
LLaMA 4 (Meta)Apr 2025
LLaMA-4
Scout/Maverick:
17B
LLaMA-4
Behemoth: 288B
Trained on ~15T tokens
text + image-text pairs
GPT-4V (OpenAI)Sept 2023
Not disclosed
- Generally believed to
be in the range of
hundreds of billions
of parameters
(speculation based on
GPT-4 family)
Not disclosed
- Generally believed to be
in the range of hundreds
of billions of
parameters (speculation
based on GPT-4 family)
Mixture of web-scale text
+ multimodal datasets;
exact scale undisclosed
Qwen-VL
(Alibaba)
2023 → 2025 Qwen2.5-VL-3BQwen2.5-VL-72B
Trillions of tokens +
hundreds of millions of
image-text pairs
Open Source
Open Source
45
CNNs Vision
Transformers
(ViTs)
Foundation
Models
Efficiency /
Latency / Power
Very efficient;
good for edge
Moderate
efficiency
Least efficient;
heavy compute
Accuracy
(Typically)
High on small/
medium data
Higher with
large data
Highest;
state-of-the-art
Dataset Size Works well with
small/medium
Needs large
datasets
Trained on
web-scale data
Cost for
Development
Low Medium–HighVery high (train);
API use easier
Tasks Real-time,
embedded,
small-data vision
Classification,
detection,
segmentation,
captioning
Few/Zero-shot,
multimodal,
captioning,
generative
Practice - w/ Foundational Model
fine-tuning --> AGENDA
PARTNER UP
--meet your
neighbor
Practice - w/ Foundational Model
fine-tuning --> AGENDA
PARTNER UP
--meet your
neighbor
n8n AI Workflow
with Custom Multi-Modal (image + text) Gemini Model to
Base Model Spanish Translation
Naman
Rajani
CS graduate student,
iLab, CSUEB
49
AI workflows and n8n
n8n you can chain together nodes such as:
●Input/Trigger nodes: A new email arrives, a form is submitted, or a file is uploaded.
●Processing nodes: Send the text to an LLM (OpenAI, Hugging Face, etc.) for
summarization, classification, or generation.
●Logic/Control nodes: Apply filters, conditions, or loops to handle different scenarios.
●Output/Action nodes: Save results to a database, post to Slack, send an email, or
update a CRM.
AI Workflow: smart processing
of new customer support ticket
Trigger: New customer support ticket is
created.
AI Step: Send ticket text to GPT for sentiment
analysis and auto-tagging.
Decision Logic: If sentiment is “angry,”
escalate to a human immediately; otherwise,
generate a draft response.
Action: Post the result in Slack and log the
response in a database
51
*** example***
n8n has AI templates (free)
Top 3799 AI automation workflows
52
Template : Basic Automatic Gmail Email Labelling with OpenAI &Gmail API
53
Multi-modal (vision + text) Chat input Processed by
fine-tuned Gemini Model & Output Translated to Spanish
54
Trigger: User enters in
image and Q (VQA task)
AI Step 1: send image+Q
to fine-tuned Gemini model
Decision Logic:
Understand this
Question+data with our
custom model & provide
answer in English and
Spanish
Action (AI step 2):
Translate the answer to
Spanish
*** example YOU WILL DO ***
YOU TRY IT
Instructions
Video Tutorial
56
Connect With Us:
Next?
> VLA – VISION LANGUAGE
ACTION MODELS
Lynne Grewe
Naman Rajani Nidhi Prajapati Ujas Goti
“n8n + Vertex AI Gemini: Using a fine-tuned multi-modal model (endpoint) and
Gemini API in one workflow”
Overview
You’ll build an n8n workflow that:
● Accepts chat input (optionally with an image)
● Sends (image + prompt) to your fine-tuned Gemini model deployed on a Vertex AI Endpoint
● Uses an AI Agent for Spanish translation
via Google Gemini (AI Studio) API via the n8n “Google Gemini (PaLM) API” node
Prerequisites
● Node.js installed locally
● n8n installed and running locally (default: http://localhost:5678)
● Google Cloud project
● Fine-tuned Gemini model deployed to a Vertex AI Endpoint
● Google AI Studio API key (for the Gemini API node)
Keep these handy (you’ll paste them later):
● PROJECT_ID
● LOCATION (region)
● ENDPOINT_ID
● OAuth Client ID and Client Secret (from Google Cloud Console)
● Google AI Studio API Key
VIDEO LINK
N8n AI workflow purpose & demo [start-2:00]
Part A — Vertex AI & OAuth Setup
1) Create/confirm the Vertex AI Endpoint [2:00- ]
1.Go to Google Cloud Console → Vertex AI → Endpoints.
2.Click Create.
●Name: your choice
●Region: same as your model
●Model: select your fine-tuned Gemini model
3.After creation, open the endpoint and note:
●Endpoint ID (save as ENDPOINT_ID)
●Region (save as LOCATION)
4.Also note your Project ID (save as PROJECT_ID).
2) Create OAuth 2.0 client (for n8n to call the endpoint) [2:00-3:50 ]
1.In Console, go to APIs & Services → Credentials
IF you see “Remember to Configure the OAuth Consent Screen...” then FOLLOW THE
INSTRUCTIONS
2.Click Create Credentials → OAuth client ID. (for web application)
3.Configure:
oName: e.g., n8n VertexAI OAuth
oAuthorized JavaScript origins:
http://localhost:5678
oAuthorized redirect URIs:
http://localhost:5678/rest/oauth2-credential/callback
4.Create it and download the JSON (you’ll use Client ID and Client Secret in n8n).
Part B — n8n Credentials
1) Google OAuth2 API (for calling your custom endpoint) [3:50-5:57 ]
1.In n8n, click the down arrow next to Create Workflow (top-right) → Credentials → New.
2.Search Google OAuth2 API and select it.
3.From the downloaded JSON:
Client ID: <your_client_id>
Client Secret: <your_client_secret>
4.Scope: https://www.googleapis.com/auth/cloud-platform
5.Save, then click Connect and complete Google sign-in.
2) Google Gemini (PaLM) API (for calling Gemini via AI Studio key) [3:50-5:57 ]
1.Click here to create api key : https://aistudio.google.com/apikey
2.In n8n, create another credential → search Google Gemini (PaLM) API.
3.API Key: paste your Google AI Studio key.
4.Save.
Part C — Build the n8n Workflow [5:57- ]
1) Chat Input node [5:57-6:30 ]
● Add Chat Input.
● Enable Allow file uploads (if you want to send an image to your fine-tuned model).
2) Extract from File (Binary → Base64) [6:30-7:25 ]
● Add Extract from File node.
● Input Binary Field: data0
● Destination Output Field: data
● Options:
File Encoding: base64
Keep Source: both
This converts the uploaded image to Base64 in $json.data and keeps the original binary + metadata in
$json.files[0].
3) HTTP Request → Call your Vertex AI Endpoint (custom model) [7:25-13:44 ]
●Add HTTP Request node (name it e.g., Call Fine-Tuned Gemini).
●Method: POST
●URL:
https://aiplatform.googleapis.com/v1/projects/<PROJECT_ID>/locations/<L OCATION>
/endpoints/<ENDPOINT _ID>:generat eContent
You can hardcode the IDs, or pass them in from Chat Input using fields like $json.projectId, etc.
●Authentication: Predefined Credential Type
●Credential Type: Google OAuth2 API
●Google OAuth2 API: choose the credential you created earlier
●Send Headers: true
Header: Content-Type: application/json
● Send Body: true
Body Content Type: JSON
JSON Body (copy-paste as is; n8n’s expressions are supported inside {{ ... }}):
{
"contents": [
{
"role": "user",
"parts": [
{
"text": "{{ $json.chatInput }} what is this image about"
},
{
"inlineData": {
"mimeType": "{{ $json.files[0].mimeType }}",
"data": "{{ $json.data }}"
}
}
]
}
]
}
If you don’t upload an image, remove the inlineData part or guard it with conditions.
4) AI Agent node (post-processing / orchestration)[13:44-14:48 ]
●Add AI Agent node.
●Prompt:
{{ $json.candidates[0].content.parts[0].text }}
(This pulls the text result from the previous node—adjust the path if your HTTP response structure differs.)
●Options → System Message: e.g.,
Translate this response to Spanish.
(or any instruction you want)
5) Google Gemini Chat Model (base model not custom) (AI Studio API) [14:48-16:05 ]
●Add Google Gemini Chat Model node.
●Credentials: select your Google Gemini (PaLM) API credential.
●Model: gemini-2.5-flash-lite (or pick another).
●Prompt: you can feed the output from the HTTP node, or your own text.
6) Wire it up [all-done as creating ]
●Chat Input → Extract from File
●Extract from File → HTTP Request (Endpoint)
●HTTP Request → Google Gemini Chat Model (optional chain, or use either/or)
●Google Gemini Chat Model → AI Agent (or HTTP Request → AI Agent if skipping the Gemini
API step)
7) Run workflow [16:05-16:49 ]
●Use Chat Input to pose question & upload image, hit enter & see results
Tips & Notes
●No image?
Remove the inlineData section or add a conditional so the HTTP body only includes inlineData when
a file exists.
●Secrets
Don’t commit your Client Secret or API Key to version control.
●Testing
Start with a plain text prompt (no image) to confirm your endpoint and OAuth are working, then
add image handling.
Quick Checklist
●Vertex AI Endpoint exists and is bound to your fine-tuned model
●Saved PROJECT_ID, LOCATION, ENDPOINT_ID
●OAuth2 credential in n8n (scope: Cloud Platform)
●Google AI Studio API key credential in n8n
●n8n workflow: Chat Input → Extract from File → HTTP Request → AI Agent → Gemini Chat
Model
Reference :
https://cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints/generateContent