OpenChain Webinar- The Role of Data in the Supply Chain of AI - 2024-10-10
ShaneCoughlan3
239 views
26 slides
Oct 11, 2024
Slide 1 of 26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
About This Presentation
OpenChain Webinar- The Role of Data in the Supply Chain of AI - 2024-10-10
Size: 427.89 KB
Language: en
Added: Oct 11, 2024
Slides: 26 pages
Slide Content
The Role of Data in the
Supply Chain of AI
10 October 2024
Nick Schifano
CEO FastCatalog.ai [email protected]
Commercializing an AI powered product or service
Source Build Deploy Manage
Source Build Deploy Manage
Compute
Build stage - simplified view
Data
Public
data
Private
data
Prior AI models
OS AI
model
Private
AI model
New AI
Model
Training Framework
Training Fine-tuning
AI App
Technology Infrastructure Components
AI Training Frameworks Compute
AI “Raw materials”
Data and AI Model Hosting AI Models Providers
Compute
Build stage - simplified view
Data
Public
data
Private
data
Prior AI models
OS AI
model
Private
AI model
New AI
Model
Training Framework
Training Fine-tuning
AI App
Source Build Deploy Manage
Popular public datasets - examples
Common crawl
•text
•https://commoncrawl.org/
•2.8 billion web pages (410 TiB of uncompressed
content)
•Limited license (not OSS)
•Browse wrap
•Publicly available
Red Pajama
•text
•https://github.com/togethercomputer/RedPajama-
Data
•https://huggingface.co/datasets/togethercomputer/
RedPajama-Data-V2
•30 Trillion Tokens
•Apache-2.0
•Browse wrap
•Publicly available
BookCorpus
•text
•https://huggingface.co/datasets/bookcorpus/bookc
orpus
•7000 books scraped from the indie ebook
distribution website Smashwords
•Browse wrap
•Publicly available
LAION-400-MILLION
•Images
•https://laion.ai/blog/laion-400-open-dataset/
•400 million images
•CC-BY-4 for the metadata dataset
•Browse wrap
•Publicly available
Popular “open source” models - examples
Llama 3.1
•Text-text
•https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct
•Training Data: pretrained on ~15 trillion tokens (100 tokens ~ 75 words) of data from
publicly available sources. The fine-tuning data includes publicly available instruction
datasets, as well as over 25M synthetically generated examples.
•Hundreds of variants (fine-tuned, quantized etc) based on Llama 3.1 on Hugging Face
•Clickthrough
•LLAMA 3.1 Community License Agreement
If you distribute or make available the Llama Materials (or any derivative works thereof), or a product or
service (including another AI model) that contains any of them, you shall (A) provide a copy of this
Agreement with any such Llama Materials; and (B) prominently display “Built with Llama” on a related
website, user interface, blogpost, about page, or product documentation. If you use the Llama
Materials or any outputs or results of the Llama Materials to create, train, fine tune, or otherwise
improve an AI model, which is distributed or made available, you shall also include “Llama” at the
beginning of any such AI model name.
Popular “open source” models
Black Forest FLUX.1-dev
•Text-image
•https://huggingface.co/black-forest-labs/FLUX.1-dev
•No information on training data
•5000+ variants based on FLUX on HF
•Clickthrough
•FLUX.1 [dev] Non-Commercial License
You may only access, use, Distribute, or creative Derivatives of or the FLUX.1 [dev]
Model or Derivatives for Non-Commercial Purposes
Source Build Deploy Manage
AI deployment models & roles
Cloud OnPrem Device AI2H2H
AI Model
Provider
AI Model
Deployer
Regulations and guidelines about
AI data transparency
•CA - AB 2013 -Generative artificial intelligence: training data transparency
•https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240AB2013
•USA Federal
•NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0) -
https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
•AI Executive order -https://www.whitehouse.gov/briefing-room/presidential-
actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-
and-use-of-artificial-intelligence/
•EU AI Act
•https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689
•Japan MITI AI Guidelines for Business Version 1.0
•https://www.soumu.go.jp/main_content/000943087.pdf
•G7 Hiroshima Principles
•https://www.mofa.go.jp/files/100573473.pdf
AB 2013
Training Data
Documentation
Disclosure
Source Datasets
Details
Data Processing Purpose
Post documentation
before making
generative AI publicly
available (Section 3111)
Exemptions: Systems
solely for security,
national security, or
aircraft (Section
3111(b)(1-3))
Dataset name, provider,
modality, presence of
synthetic data (Section
3111(a)(12))
Personal info: disclose if
personal data is
included (Section
3111(a)(7))
IP, collection dates:
disclose copyright
status (Section
3111(a)(5))
Must disclose cleaning
or modification efforts
(Section 3111(a)(9))
Description of how
datasets further the AI
system’s purpose
(Section 3111(a)(2))
Refer to legislation for exact requirement
NIST Artificial Intelligence Risk Management
Framework (AI RMF 1.0)
Training Data Documentation Disclosure Source Datasets Details
3.4 Trustworthy AI depends upon accountability.
Accountability presupposes transparency.
[…] Meaningful transparency provides access to
appropriate levels of information based on the stage
of the AI lifecycle
3.4 […] Maintaining the provenance of training data
and supporting attribution of the AI system’s
decisions to subsets of training data can assist with
both transparency and accountability.
Training data may also be subject to copyright and
should follow applicable intellectual
property rights laws.
Refer to legislation for exact requirement
EU AI ACT
Training Data Documentation
Disclosure
Source Datasets Details Data Processing Purpose
Obligation: Providers of
general-purpose AI models
must maintain and keep up-to-
date technical documentation,
including the training and
testing process and results of
evaluation. This documentation
must be provided to the AI
Office and national competent
authorities upon request.
(Section 1(a))
Exemptions: Providers of open-
source AI models released
under a free license, with
publicly available parameters,
are exempt from this obligation
unless the model poses
systemic risks. (Section 2)
Providers must draw up and
make publicly available a
detailed summary of the
content used for training the AI
model. This includes dataset
sources, modality, and any
synthetic data used. (Section
1(d))
Providers must also comply
with copyright law, ensuring
that any reserved rights in
datasets are respected.
(Section 1(c))
The documentation must
include information on any data
cleaning, anonymization, or
modifications made during the
AI model's development.
Providers must comply with
state-of-the-art technologies
for identifying and protecting
intellectual property rights,
particularly under Directive (EU)
2019/790. (Section 1(c))
The documentation must
provide a clear understanding
of the purpose of the training
data in relation to the general-
purpose AI model's capabilities
and limitations. Providers must
ensure that the information
enables other AI system
providers to understand how
the AI model can be integrated
and used responsibly. (Section
1(b)(i))
Refer to legislation for exact requirement
EU AI ACT Annex XI
Training Data Documentation
Disclosure
Source Datasets Details Data Processing Purpose
Providers must include:
- A general description of the AI
model’s tasks, intended use
cases, and the AI systems in
which it can be integrated.
- Acceptable use policies for
the AI model.
- Date of release and methods
of distribution.
- The architecture, including the
number of parameters, and the
modality (e.g., text, image) and
format of inputs/outputs.
- Applicable license
information. (Section 1(1))
Providers must disclose:
- Information on training,
testing, and validation data,
including:
- Type of data used (e.g., text,
images).
- Provenance of the data, how it
was obtained, and curation
methodologies (e.g., cleaning,
filtering).
- Number of data points, scope,
and characteristics of the
dataset.
- Methods to detect biases and
ensure suitability of data
sources. (Section 1(2)(c))
Providers must describe:
- The data curation processes
applied during model
development, including
cleaning, filtering, or other
methods used to process the
data.
- Measures in place to
anonymize data if applicable.
(Section 1(2)(c))
Providers must explain:
- The key design specifications
and choices for the model,
including the rationale and
assumptions made during the
design.
- What the model is optimized
for, including the relevance of
different parameters and design
decisions. (Section 1(2)(b))
Refer to legislation for exact requirement
MITI AI Guidelines for Business Version
1.0
Training Data Documentation Disclosure Source Datasets Details
D-7) ii. In order to improve traceability and
transparency, prepare documents on your AI system
development processes, data collection and labeling
affecting decision-makings, algorithms you have
used, and the like, as far as possible in a form that
third parties can use to validate the documents ("7)
Accountability"). (Note) This does not require to
disclose all the documents prepared.
C-7) Accountability
Improving traceability.
Establish a situation that allows the origin of data and
decisions made during the development, provision,
or use of the AI system or service to be traced forward
and backward to the extent that is reasonable and
technically possible.
Refer to legislation for exact requirement
G7 Hiroshima Principles
1 Take appropriate measures throughout the development of advanced AI systems, including prior to and
throughout their deployment and placement on the market, to identify, evaluate, and mitigate risks across
the AI lifecycle
•enable traceability, in relation to datasets, processes, and decisions made during system development. These
measures should be documented and supported by regularly updated technical documentation.
3 Publicly report advanced AI systems’ capabilities, limitations and domains of appropriate and
inappropriate use, to support ensuring sufficient transparency, thereby contributing to increase
accountability.
•This should include publishing transparency reports containing meaningful information for all new significant
releases of advanced AI systems.
•These reports, instruction for use and relevant technical documentation, as appropriate as, should be kept
up-to-date
Refer to legislation for exact requirement
Regulations and guidelines about
AI data transparency - summary
Jurisdiction /
Reference
Disclosure
requirement?
Data origin disclosure Data processing disclosure Applicability
(Provider / Deployer)
USA CA AB 2013
AI training data
transparency
Post documentation
before making
generative AI publicly
available (Section
3111)
Dataset name, provider, modality,
presence of synthetic data, personal info,
IP, collection dates, etc
cleaning or modification efforts, etc“Developer”
(probably both roles)
USA NIST AI RMF 1.0“Accountability
presupposes
transparency”
provenance of training data, etc Providers
EU AI Act Documentation
[about AI model] must
be provided to the AI
Office and national
competent authorities
upon request
Providers must draw up and make publicly
available a detailed summary of the
content used for training the AI model. This
includes dataset sources, modality, and
any synthetic data used. (Section 1(d))
Providers must describe:
Data curation processes applied
during model development,
including cleaning, filtering
Measures in place to anonymize
data if applicable, address bias, etc
Providers, certain
Deployers
Japan MITI AI
Guidelines
“Establish a situation
that allows the origin
of data to be traced
[…]”
prepare documents on your AI system
development processes, data collection
and labeling, etc
Providers?
G7 Hiroshima
Principles
“publishing
transparency reports
containing meaningful
information”
enable traceability in relation to datasetsenable traceability, in relation
processes, and decisions made during
system development.
Providers?
Refer to legislation for exact requirement
Source Build Deploy Manage
Data supply chain scenarios
Scenario Key enabling capabilities
Use only compliant datasets or AI models in your AI
systems
Profiles for acceptable datasets characteristics vs
intended use
Catalog of datasets or AI models
AI Compliance Review process
Produce a transparency report for your AI-powered
product or service
Catalog of datasets or AI models
Inventory of datasets or AI models used in your
products / services
Standard for transparency reports
Address an adverse event in the supply chainCatalog of datasets or AI models
Inventory of datasets or AI models used in your
products / services
Adverse event reporting template or standard
Use only compliant datasets or AI models in your
AI systems
https: //spdx. g ithub. io/spdx-spec/v 3. 0. 1/model/A I /A I /
•Define the characteristics that need to be traced through the ingestion, development, deployment of an AI model or
system (dataset name, provider, license, presence of personal data, etc)
•Define profiles of datasets characteristics (open-source license, non-commercial use restriction, anonymization, etc)
based on intended use (internal use, commercial use –generative AI, commercial use –end user interaction, etc)
•Set required level of authority for approvals and deviations
Policy
•Define your risk management posture: speed vs ‘safety’
•Balanced posture: Fast lane for ‘safe’ profiles of datasets (or models), mandatory review for others
•Document clear mitigations for reviewers in case of deviations
•Decide about systems for tracking documentation, AI assets inventory and product use
•Define clear escalation process and decision considerations
Process
•Define minimum levels of information that must be tracked
•Use canonical designation for public datasets or AI models (see SPDX license lists for OSS licenses)
•Leverage industry standards (Model Cards, Dataset Cards, SPDX AI BOM spec https://spdx.github.io/spdx-
spec/v3.0.1/model/AI/AI/
•Document centrally decision rationale, impacted system and relevant component (dataset or AI model)
Document
Refer to legislation for exact requirement
Approaches to govern the use of public models
and data for AI
Review every
model or data use
Do not review anythingCreate a ‘fast-lane’ for uses of
in-policy of models or data
Review the high-risk scenarios
Operational challenge
How to enable data scientists to quickly
and reliably find in-policy models or data
sources?
Key issues
Is the model/data publicly available?
What are the terms?
Is the model/data based on other
models or data? (Lineage)
Do these other models or data create
risks or restrictions?