Intrusion Tolerance for Networked Systems through Two-Level Feedback Control
KimHammar
9 views
47 slides
Jun 27, 2024
Slide 1 of 47
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
About This Presentation
We formulate intrusion tolerance for a system with service replicas as a two-level optimal control problem. On the local level node controllers perform intrusion recovery, and on the global level a system controller manages the replication factor. The local and global control problems can be formula...
We formulate intrusion tolerance for a system with service replicas as a two-level optimal control problem. On the local level node controllers perform intrusion recovery, and on the global level a system controller manages the replication factor. The local and global control problems can be formulated as classical problems in operations research, namely, the machine replacement problem and the inventory replenishment problem. Based on this formulation, we design TOLERANCE, a novel control architecture for intrusion-tolerant systems. We prove that the optimal control strategies on both levels have threshold
structure and design efficient algorithms for computing them. We implement and evaluate TOLERANCE in an emulation environment where we run 10 types of network intrusions. The results show that TOLERANCE can improve service availability and
reduce operational cost compared with state-of-the-art intrusion-tolerant systems.
Size: 6.76 MB
Language: en
Added: Jun 27, 2024
Slides: 47 pages
Slide Content
1/35
Intrusion Tolerance for Networked Systems
through Two-Level Feedback Control
IEEE DSN 2024, Brisbane, Australia
International Conference on Dependable Systems and Networks
Kim Hammar and Rolf Stadler [email protected]@kth.se
KTH Royal Institute of Technology
June 27, 2024
2/35
Use Case: Intrusion Tolerance
::: ClientsapigatewaysCompute nodesStorage nodes
Service
replica 1
Service
replica 2
Service
replica 3
Service
replica 4 Client interface & load balancer
IAreplicated systemoers a service to a client population.
IThe system should provide.
2/35
Use Case: Intrusion Tolerance
::: AttackerClientsapigatewaysCompute nodesStorage nodes
Service
replica 1
Service
replica 2
Service
replica 3
Service
replica 4 Client interface & load balancer
IAn
IThe system shouldtolerate intrusions.
3/35
Intrusion Tolerance (Simplied)
Intrusion eventTime of full recoveryTimeRecovery timeSurvivabilityLoss
Normal
performanceSystem performanceTolerance
Cumulative
performance loss
(want to minimize)
4/35
Increasing Demand for Intrusion-Tolerant Systems
IAs ourreliance on online services grows, there is an
increasing demand for intrusion-tolerant systems.
IExample applications:
Flight control
computer
Sensors and
actuators
Power grids
e.g.,scadasystems
1
.
Safety-critical IT systems
e.g., banking systems,
e-commerce applications
2
,
healthcare systems, etc.
Real-time control systems
e.g., ight control computer
3
.
1
Amy Babay et al. 2018 48th
Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 2018, pp. 255266.doi:
10.1109/DSN.2018.00036.
2
Jukka Soikkeli et al. IEEE
Transactions on Dependable and Secure Computing20.2 (2023), pp. 11541168.doi:
10.1109/TDSC.2022.3151462.
3
J.H. Wensley et al.
Proceedings of the IEEE66.10 (1978), pp. 12401255.doi:10.1109/PROC.1978.11114.
5/35
Theoretical Foundations of Intrusion Tolerance
L. LamportW. WeibullR.E. BarlowW.ShewhartL. AdlemanA. ShamirR. RivestD. DolevN. LynchE. DijkstraJ. GrayB. LiskovReliabilityTheoryDistributedSystemsCryptography
Intrusion-
tolerant
systems
6/35
Our Contribution
ReliabilityTheoryDistributedSystemsCryptographyControlanddecisiontheory
Intrusion-
tolerant
systemsThis paper
7/35
Building Blocks of An Intrusion-Tolerant System
HC;CrashedHealthyCompromisedCrashCrashRecoveryCompromise204060801000:20:40:60:8125 replicas50 replicas100 replicastReliability : : :Replicated systemClient interfaceRequestResponse Consensus protocol12345
1. Intrusion-tolerant consensus protocol
A quorum needs to reach agreement
to toleratefcompromised replicas.
2. Replication strategy
Cost-reliability trade-o.
3. Recovery strategy
Compromises will occur ast! 1.
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- No recoveries
Published 1995
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- No recoveries
Published 1998
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
-Periodicrecoveries
Published 2002
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2004
8/35
Prior Work on Intrusion-Tolerant Systems
-Adaptive replicationbased on heuristics
- Periodic recoveries
Published 2006
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2006
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
-Supports both periodic and reactive recoveries
- Does not provide reactive recovery strategies
Published 2007
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2011
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2018
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2023
8/35
Prior Work on Intrusion-Tolerant Systems
- Fixed number of replicas
- Periodic recoveries
Published 2023Can we do better by leveraging decision theory and optimal control?
9/35
ThetoleranceArchitecture
Two-levelrecoveryand replicationcontrol with feedback.
toleranceNode 1Privileged domain Application domain
Service
replicaNode controller
ids
alerts
reco-
veryVirtualization layerHardwareNode 2Privileged domain Application domain
Service
replicaNode controller
ids
alerts
reco-
veryVirtualization layerHardwareNodeNtPrivileged domain Application domain
Service
replicaNode controller
ids
alerts
reco-
veryVirtualization layerHardware: : :Consensus protocolSystem controllerState estimateEvict or addState estimateEvict or addState estimateEvict or add : : :Service requestsResponsesClients AttackerIntrusion attempts
10/35
Denition 1 (Correct service)
The system providescorrect serviceif the healthy replicas satisfy
the following properties:
Each request is eventually executed: (Liveness)
Each executed request was sent by a client. (Validity)
Each replica executes the same request sequence. (Safety)
11/35
Proposition 1 (Correctness oftolerance)
A system that implements thetolerancearchitectureprovides
correct serviceif
Network links are authenticated.
At most f nodes are compromised or crashed simultaneously.
Nt2f+1.
The system is partially synchronous.
12/35
Intrusion Tolerance as a Two-Level Control Problem
: : :1(b1)2(b2)3(b3)4(b4)Nt
(bNt
)
Belief
transmissionsNode controllers
Replicated
systemSystem controller(b1; : : : ;bNt
)b1b2b3b4bNt
: : :AttackerClients
IThe.
IThe.
12/35
Assumption 1
The probability that the system controller fails is negligible.
Assumption 2
Compromise and crash events are statistically independent across
nodes.
: : :1(b1)2(b2)3(b3)4(b4)Nt
(bNt
)
Belief
transmissionsNode controllers
Replicated
systemSystem controller(b1; : : : ;bNt
)b1b2b3b4bNt
: : :AttackerClients
13/35
The Local Control Problem
IPartially observed Markov decision processi.
IController actions: (R)ecover and (W)ait. ai;t2 fR;Wg.INode states: SN=f(H)ealthy;(C)ompromised;;g.si;t2 SN.
IState transition function: f(si;tjsi;t;ai;t).IpC;i: crash probability,pA;i: intrusion probability.IObservationoi;tzi(jsi;t): e.g.,idsalerts at timet.
HC;CrashedHealthyCompromisedpC;ipC;ia
(C)
i
=Ra
(A)
i
=A
15/35
Node Controller Objective
ICost:Ji,T
(R)
i
+F
(R)
i
.
IT
(R)
i
is the averagetime-to-recovery.
IF
(R)
i
is therecovery frequency.
I >1 is a scaling factor.
IBounded-time-to-recovery constraint: The time between two
recoveries can be at mostR.
1020304050607080901000:51p=0:1p=0:05p=0:025p=0:01p=0:005tFailure (crash or compromise) probability.
pis the failure probability per time-step.
16/35
Threshold Structure of the Optimal Control Strategy
0:10:20:30:40:50:60:70:80:910:40:60:81alpha vectorsE[Jijbi;1]bi;1
?
wait regionrecovery region
The controller's optimal cost function.
Theorem 2
There exists an optimal control strategy that satises
?
i;t(bi;t) =R()bi;t
?
i;t 8t;
where
?
i;t
2[0;1]is a threshold.
17/35
Ecient Computation of Optimal Recovery Strategies
Algorithm 1:Threshold Optimization
1Input:Objective functionJi, parametric optimizerpo.
2Output:An approximate optimal control strategy^i;.
3Algorithm
4 [0;1].
5 For each2, denei;(bi;t)as
6 i;(bi;t),
(
R ifbi;t
W otherwise.
7 J Ei;
[Ji].
8 ^i; po(;J).
9 return^i;.
IExamples of: cem,bo,
cma-es,de,spsa, etc.
18/35
Ecient Computation of Optimal Recovery Strategies
51015202510
0
10
1
10
2
10
3
cemdebospsadynamic programmingRTime (min)
Mean compute time to obtain an optimal recovery strategy for dierent
values of the bounded-time-to-recovery constraintR.
19/35
The Benet of Optimal Recovery Control24681012141618200:20:30:4Optimal strategyPeriodic strategyDKL(no intrusionkintrusion)Ji(4)Benet of optimal recovery
Optimal recovery control
tional cost intrusion detection model is
available.
Key insight
20/35
Intrusion Tolerance as a Two-Level Control Problem
: : :1(b1)2(b2)3(b3)4(b4)Nt
(bNt
)
Belief
transmissionsNode controllers
Replicated
systemSystem controller(b1; : : : ;bNt
)b1b2b3b4bNt
: : :AttackerClients
21/35
The Global Control Problem
IConstrained Markov decision process.
IStates: SS=f0;1; : : : ;smaxg, the number of healthy nodes.IController actions: Add a
(C)
t2 f0;1gnodes.
IDynamicsf:IMarkovstrategy:
:SS! f0;1g:
: : :1(b1)2(b2)3(b3)4(b4)Nt
(bNt
)
Belief
transmissionsNode controllers
Replicated
systemSystem controller(b1; : : : ;bNt
)b1b2b3b4bNt
: : :AttackerClients
22/35
System Controller Objective
ICost:J,limT!1
P
T
t=1
at
T
.
IConstraint: T
(A)
A, whereT
(A)
is the availability.
A Allowed service downtime per year
0:9 36 days
0:95 18 days
0:99 3 days
0:999 8 hours
0:9999 52 minutes
0:99999 5 minutes
1 0 minutes
23/35
System Reliability Analysis
ITheMean-time-to-failure(mttf) is the
of a state wherestf:
E[T
(F)
jS1=s1] =E
(St)t1
h
infft1jStfg jS1=s1
i
:
102030405060708090100100200300pi=0:05pi=0:025pi=0:01N1E[T
(F)
]
Themttfin function of the number of initial nodesN1and failure
probability per nodepi.
24/35
Theorem 3 (Optimal Control Strategy Existence)
Assuming
(A)The Markov chain induced by any control strategy is
unichain.
(B)The.
Then the following holds.
1.There.
2.The optimal strategy has a.
3.An optimal replication control strategy can be computed by
usinglinear programming.
25/35
Ecient Computation of Optimal Replication Control
Strategies
481632641282565121024204810
0
10
2
Maximum number of nodessmaxTime (s)
Mean compute time to obtain an optimal replication control strategy.
26/35
The Benet of Optimal Replication Control2004006008001;0000:51Optimal strategyN1= 10N1= 100tAvailabilityBenet of optimal replication control
Optimal replication control canguarantee a high service
availability in expectation. The
cation is mainly prominent for long-running systems.
Key insight
33/35
Estimated Distributions of Intrusion Alerts0 2000 4000 6000 8000
bz
i
(
o
i
j
a
(A) i
)
cve-2010-0426
0 2000 4000 6000 8000
cve-2015-3306
0 2000 4000 6000 8000
bz
i
(
o
i
j
a
(A) i
)
cve-2015-5602
0 2000 4000 6000 8000
cve-2016-10033
0 2000 4000 6000 8000
bz
i
(
o
i
j
a
(A) i
)
cwe-89
0 2000 4000 6000 8000
cve-2017-7494
0 2000 4000 6000 8000
oi2 O
bz
i
(
o
i
j
a
(A) i
)
cve-2014-6271
0 5000 10000 15000 20000
oi2 O
ftp,ssh,telnetbrute force
attack (a
(A)
i
=A) false alarms (a
(A)
i
=F)
IWe estimate the observation distributionzwith the
empirical distribution
b
Zbased onMsamples.
Ibz!
a.s
zasM! 1(Glivenko-Cantelli theorem).
34/35
Comparison with State-of-the-art Intrusion-Tolerant
Systems
00:5110
1
10
2
00:10:200:5110
1
10
2
00:10:200:511525110
1
10
2
toleranceno-recoveryperiodicperiodic-adaptive1525100:10:215251Maximum time-to-recoveryRMaximum time-to-recoveryRMaximum time-to-recoveryRAverage availabilityT
(A)
Average time-to-recoveryT
(R)
Average recovery frequencyF
(R)
N1=3N1=6N1=9
Comparison between our optimal control strategies and the baselines;
x-axes indicate values ofR; rows relate to the number of initial nodes
N1.
35/35
Conclusion
IWe present acontrol-theoretic model of intrusion
tolerance.
IWe establish.
IWe.
IOur control-theoretic strategies have
guarantees and signicantly better practical performance
the heuristic control strategies used in state-of-the-art
intrusion-tolerant systems.
: : :1(b1)2(b2)3(b3)4(b4)Nt
(bNt
)
Belief
transmissionsNode controllers
Replicated
systemSystem controller(b1; : : : ;bNt
)b1b2b3b4bNt
: : :AttackerClients