AX_2023_ASA (1)machine learning-based breeding .pdf

yejian 11 views 38 slides Jul 27, 2024
Slide 1
Slide 1 of 38
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38

About This Presentation

machine learning-based breeding


Slide Content

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Machine learning-based breeding
Alencar Xavier
Breeding Analyst at Corteva
Adjunct professor at Purdue

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
1.Introduction
•More data
•Branching ML
2.Machines
•Filters
•Engines
3.Analytics
•Target G x E x M
•Validation
•Cases of study
4.Conclusion
2
Outline

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
1.Introduction
•More data
•Branching ML
2.Machines
•Filters
•Engines
3.Analytics
•Target G x E x M
•Validation
•Cases of study
4.Conclusion
3

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Stephens, Z. D.et al. (2015). Big data: astronomical or
genomical?PLoS biology,13(7), e1002195.
4
The Cost of Sequencing a Human Genome. NIH.
https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/
More Pheno More Geno More Env
•UC Merced GridMET
•NWS NOAA
•NASA GISS, NASA power
•Harmonized SoilDB
•USDA SSURGO
More
Computing
https://www.mdpi.com/2076-3417/12/5/2570

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
ML in plant
breeding
Breeding
Processes
Environment
Processing;
Mapping
Phenotyping
Computer
vision; image
analysis
Genomics
Autoscoring;
genome
assembly
Gene Editing
Target
identification
Breeding
Analytics
Prediction &
Classification
Breeding
decisions
Clustering &
Associations
Germplasm and
environment
analysis
5
ML in plant
breeding
Breeding
Processes
Environment
Processing;
Mapping
Phenotyping
Computer
vision; image
analysis
Genomics
Autoscoring;
genome
assembly
Gene Editing
Target
identification
Breeding
Analytics
Prediction &
Classification
Breeding
decisions
Clustering &
Associations
Germplasm and
environment
analysis

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
ML in breeding processes
6
https://www.nature.com/articles/s41598-022-06336-y
Embryo rescue
DH production
https://www.mdpi.com/2673-2688/2/3/26
https://www.biomedcentral.com/collections/phenomics
Disease, stress scoring
https://www.mdpi.com/2072-4292/8/12/1031
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7706325/
Phenotype automation
(e.g., plant height, identify new traits)
https://doi.org/10.1093/bioinformatics/btab268
Gene editing targets
https://doi.org/10.1093/bioinformatics/btaa971
Latent weather, soil
Enhancing databases, automating lab tasks field work
https://doi.org/10.1186/1753-6561-3-s7-s58
https://www.nature.com/articles/s41467-022-29843-y
SNP calls, genome assembly
https://www.publish.csiro.au/cp/CP14007
Mapping / zoning

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
ML in plant
breeding
Breeding
Processes
Environment
Processing;
Mapping
Phenotyping
Computer
vision; image
analysis
Genomics
Autoscoring;
genome
assembly
Gene Editing
Target
identification
Breeding
Analytics
Prediction &
Classification
Breeding
decisions
Clustering &
Associations
Germplasm and
environment
analysis
7
ML in plant
breeding
Breeding
Processes
Environment
Processing;
Mapping
Phenotyping
Computer
vision; image
analysis
Genomics
Autoscoring;
genome
assembly
Gene Editing
Target
identification
Breeding
Analytics
Prediction &
Classification
Breeding
decisions
Clustering &
Associations
Germplasm and
environment
analysis

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
1.Introduction
•More data
•Branching ML
2.Machines
•Filters
•Engines
3.Analytics
•Target G x E x M
•Validation
•Cases of study
4.Conclusion
8

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Machine Learning Engines
9

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
10
Machine Learning Linear Algebra
Statistics

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
11
Key idea of supervised learning: FILTERING
Simple filter
��Multiple filters
??????=�+�
�
�
��
��
??????=�+ℎ+�
�=�+�+�
Multi-task filter

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Why bother with multiple filters?
12
Field variation Family layout
Some families were placed on unfavorable side of the field…
SoyNAM field,
Indiana 2014

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
13
??????����
������??????
??????�������
�������??????�
??????,�
Separation of
tangled signals!

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Simple (bivariate) model:
??????=�+�
??????�??????
y
1
y
2
=
σ
a
1
2
σ
a
12
σ
a
12
σ
a
2
2
+
σ
e
1
2
σ
e
12
σ
e
12
σ
e
2
2
INFORMATION GAIN
14
Why bother with multi-task filters?

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
y=Zg+e,y∼N(0,V)
y
1
y
2
=
Z
10
0Z
2
g
1
g
2
+
e
1
e
2
•Covariance structure
V=G⊗Σ
a+I⊗Σ
e=G⊗
σ
a
1
2
σ
a
12
σ
a
12
σ
a
2
2
+I⊗
σ
e
1
2
σ
e
12
σ
e
12
σ
e
2
2
•Model equation
Z
1

Σ
e
11
Z
1+G
−1
Σ
a
11
Z
1

Σ
e
12
Z
2+G
−1
Σ
a
12
Z
2

Σ
e
12
Z
1+G
−1
Σ
a
12
Z
2

Σ
e
22
Z
2+G
−1
Σ
a
22
g
1
g
2
=
Z
1

Σ
e
11
y
1+Σ
e
12
y
2
Z
2

Σ
e
22
y
2+Σ
e
12
y
1
•Univariate vs bivariate
g
1=Z
1

Σ
e
11
Z
1+G
−1
Σ
a
11−1
Z
1

Σ
e
11
y
1
g
1|g
2=Z
1

Σ
e
11
Z
1+G
−1
Σ
a
11−1
Z
1

Σ
e
11
y
1+Σ
e
12
y
2−Z
1

Σ
e
12
Z
2+G
−1
Σ
a
12
g
2
INFORMATION
GAIN
15
Why bother with multi-task filters?

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Does the choice of filter matter?
•ADDITIVELINEAR FILTERS(GEBV)
•Pattern: ADDITIVE GENETICS -heritable
•Method: GBLUP, RIDGE, LASSO
•Suits: RECYCLING, ADVANCEMENT
•NON-LINEAR FILTERS(EGV)
•Pattern: ANY GENETIC SIGNAL
•Method: RKHS, DNN, Random Forest
•Suits: ADVANCEMENT, PRODUCT PLACEMENT
Advancement
Recycling
Incorporation
16
Breeding
pipeline

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
17
Main classes of learners
??????=��+�
Linear
models
Kernel
methods
Neural
network
Ensembled
trees

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Solving: ??????=??????�+�
•Coordinate descent
෠b
j
t+1
=
x
j

y−X
−j
෠b
−j
x
j

x
j+λ
•Gradient descent
෠b
t+1
=b
t

2r
n
X

y−X෠b
t
+λ෠b
t
•Second order
෠b=X

X+λ
−1
(X

y)
18
Used for everything else
Used for p>>n solvers
Used for Deep Neural Nets
(Use diagonals of LHS)
(Does not build LHS)
(Builds entire LHS)
glmnet, BGLR, bWGR, GS3
TensorFlow Keras, PyTorch, MXNet
ASREML, lme4, SAS
I’ve created
a monster!!
�??????��??????�� → �??????��??????�(�

�+��

�)

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Coordinate descent Gradient descent
෠b
j
t+1
=
x
j

y−X
−j
෠b
−j
x
j

x
j+λ
෠b
t+1
=b
t

2r
n
X

y−X෠b
t
+λ෠b
t
What about the deep learning???????
y=ααXB
1B
2b
3+e
i.e., stack of solvers

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Unnecessarily complex analysis should not be
used as a foil to disguise lower quality datasets
Kruuk(2004apudWalsh and Lynch 2018)
22
Data > Method

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
1.Introduction
•More data
•Branching ML
2.Machines
•Filters
•Engines
3.Analytics
•Target G x E x M
•Validation
•Cases of study
4.Conclusion
23

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Analytics
24

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
“Breeding objective”
•Set of traits of interest (TOI)
bred into a
•Target population of genotypes (TPG)
for a given
•Target population of environments (TPE)
25

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
TPE, TPG, TPM
•Target population of environments (TPE)
•Influences accuracies via GxEcorrelation
•Which environments should I be able to predict?
•Target population of genotypes (TPG)
•Influences accuracies via genetic relationship
•Which genetics should I be able to predict?
•Target population of management (TPM)
•Herein nested in TPE
26

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
•Any given trial happens in each environment-management combination, that is sample of
much larger population:
e
i∈E
That is:
�??????????????????=
????????????�� (�
�)
????????????�� (�
�)
????????????�� (�
�)
????????????�� (�
�)
????????????�� (�
�)
????????????�� (�
�)
????????????�� (�
�)
????????????�� (�
�)
????????????�� (�
�)
TPE
y
e
i
y
e
j
g
E
=
σ
ge
i
2

ϵe
i
2
σ
ge
i,e
j
σ
ge
i,E
σ
ge
j,e
i
σ
ge
j
2

ϵe
j
2
σ
ge
j,E
σ
gE,e
i
σ
gE,e
j
σ
gE
2
27
E[MET]=E[TPE]?

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
NOTE: GxExMpatterns within TPE are largely
assessed using different methods of ML
28

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
•Accuracy (Wientjes et al 2016) = correlation( true signal, estimated signal ),
•It is a function of heritability, GxE, representativeness of the calibration set
•For:
y=g+e,
vary=V,varg=G
Then accuracy is
a
i=corg
i,ොg
i=
covg
i,ොg
i
varg
ivarොg
i
=
varොg
ir
GxE
2
varg
ivarොg
i
=r
GxE
2
G
i,yV
−1
G
y,i
G
i,i
Thus, we knowhow much signal to expect in any given prediction
29
TPG + TPE

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Validation schemes
•Random CV= Upper-bound predictive potential
•Leave-one-out= Assess structured scenarios (e.g., geography-out, year-out)
•Holdout= Reproduce true applications (e.g., predict individuals from upcoming)
30
Adapted from Crossa et al. (2017) doi.org/10.1016/j.tplants.2017.08.011
Genotype EnvironmentDifficulty
CV00 New New *****
CV0 Observed New ***
CV1 New Observed ***
CV2 Observed Observed *
Genetic information available in different cross-validation setups
•Intra-family: Linkage*
•Within-family: Linkage and LD
•Across-family: Relationships**, Linkage and LD
•Leave-family-out: Relationships and LD
•Untested environments: Same as above x ( GxE )
1) CV type –Test intent
3) Signal availability
2) TPE/TPG relation

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Validation metrics
•Correlations
•Most common metrics in breeding (e.g., predictability)
•Pertinent to rankingand selection of complex traits
•Prediction error
•Utilized when the predicted values must be as close as possible to original scale
•Pertinent to risk prediction (e.g., disease risk)
•Success
•Accommodate complex or subjective criteria, independent or otherwise
•Pertinent to decision involving data from multiple sources (e.g., advancement)
31

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
SoyNAM data
ES: 2012 (7 loc)
PS: 2013 (4 loc)
#Fam = 40
Genos = 5600
SNPs = 4300
Obs: 3k-5k obs/loc
CV scheme
r
GxE
2 r
GxE
2
Amount of signal that can be
captured in different structures

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Case of study
33

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Evaluation criterion

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
TPG TPE
2022 G2F GxEprediction competition

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
What was modeled?
??????|�
??????=�
??????+�|�
??????
Phenotype@i
th
Loc=i
th
LocMean+Geneticeffect@i
th
Loc
•The winning approach:
•Predict location means using mixed models and random forest
•Predict genetics using TPE/TPGindex, multi-responsefilter (mt-GBLUP)
(Two FILTERS)

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Team Name Cor Within Loc
CLAC 0.357
CGM 0.353
MPB_Group 0.342
UCD_MegaLMM 0.338
SmAL 0.285
DeepCropVision 0.281
CropEnthusiast 0.279
AllModelsAreWrong0.272
DataJanitors 0.256
supermanwasd 0.243
Team Name Cor Across Loc
breedingteam 0.650
DataJanitors 0.644
CLAC 0.631
Purdue 0.631
UCD_MegaLMM 0.628
phenomaize 0.617
igorkf 0.600
CGM 0.587
SmAL 0.586
AllModelsAreWrong 0.575
Source: Jacob Washburn, Jose Ignacio Varela, Alencar Xavier
Team NameWithin RMSE
CLAC 2.329
igorkf 2.345
phenomaize 2.374
UCD_MegaLMM 2.387
CGM 2.391
breedingteam 2.398
Purdue 2.402
SmAL 2.425
ML_APT 2.472
MPB_Group 2.544
Realized results Ranking with alternative metrics
2022 G2F GxE prediction competition

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
1.Introduction
•More data
•Branching ML
2.Machines
•Filters
•Engines
3.Analytics
•Target G x E x M
•Validation
•Cases of study
4.Conclusion
38

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
39
doi/10.5555/2969442.2969519

[email protected]
Quantitative Geneticist, Breeding Analyst LAAF
Thank you for your attention!
40
Final remarks:
1)Plant breeding uses machine learning for multiple purposes in processes and analytics
2)Filter settings are important to maximize signal, but it is less important than data
3)Validation metrics and validation schemes matter to design meaningful models
Alencar Xavier
[email protected]
Questions??
Tags