[CNKCD2025] Metric & Trace 제대로 챙기기: Grafana Stack + Alloy 중심의 데이터 수집/저장 효율화 과정

JeehyunMoon2 20 views 25 slides Sep 21, 2025
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

서비스와 시스템이 성장하면서 기존 Observability 구성만으로는 한계를 느낀 경험이 있으신가요? 대표적인 오픈소스 솔루션인 Prometheus는 강력한 범용성을 갖춘 훌륭한 도구이지만, 확장성 및 데이터 장기 보관을 고려하면 항상 만...


Slide Content

??
! Metric & Trace ý C??
Grafana Stack + Alloy º ,D I ??/wb ,/ ?
1

About me
??
•4?0 ??? ?
•Kubernetes & Cloud ??5? ?: KÐÄ3 ?(??, P¶Þ× L8 ?? S 
??.
•CNKCD 2024: ? ?5? EX? cm ?? - Cluster API ì App of Apps "
Career
•Kbank (2024.08 ~ now), SRE
•Smilegate Megaport (2022.01 ~ 2024.07), Platform Engineering & Cloud Tech
2

Index
•Background
•K8s ?D Metric ??/wb À à I  Ý
•k_<8 ?;?? ? ' ⇢ Prometheus À² Thanos, Mimir??
•Mimir ??D ?? ⇢ Alloy Clustering }X3 >à k_< ?? ,/
•K8s ?D Trace ??/wb À à I  Ý
•Trace I D ? ?? P
•? ƒ À²D tail_sampling x ?
•Cross Zone Traffic
•Lesson Learned
3

About Kbank
?I??
•2017? 4 3L À ßÑà, ?r 1 K ”y™7ï
•??7 ?3?, a7 ??? , ??8 $ç PP ¹I ???? S 
??.
•5? YouTube: ?I??D ??X (Q ?: ;I , ;?, MSA, AI/ML (AWS Summit Seoul 2024)
•Kubernetes?…
•AWS EKS ??5? ‘‡— 1? ? (> ?l 
??. PP ¹I P¶8 b_?? S 
??.
4

Background
•prometheus: 1 ??I? Prometheus3 Rollout á LI S8 z k_< , *
•opentelemetry-collector: DaemonSet5? ??? ? á ?? ??. (( ? n? B?)
•jaeger: in-memory DB3 ?? ?, production À²× 10 Ir traceE ‘% >? (+OOMKill W± ??)
Ô a?, Observability ??I ?? ? ?:.
y?x5? >? ,Ñ ì b? ?I ¯ð ??
5

P¶ ‰5ý Grafana Stack8 ;
wb? ?: D ? ???_ ??I 2L?
(Loki, Tempo, Mimir)
,à v> ??\8 ???? Collector

( Ý I47 Grafana Agent)
? 
??8 >L??, Grafana ?? b?I œ7 j8 Šðî:.
Storage Ô É
Collector Ô É
6

" K8s ?D Metric ??/wb À à I  Ý
7

K8s ƒ À² k_<8 ?;?? ? '
PrometheusE ??? ?, >?? ??\
prometheus-operator
•Prometheus ?;3 Kubernetes Operator ?ffi5?
•Operator " ??D k_< ?? ? ?
•(ServiceMonitor, PodMonitor a CRD)
kube-prometheus
•Service Discovery ??5? k_<8 W2 ?? (? uL À ??=)
•Helm 0_3 >ç >xK ???_( MI ?? >?
@žÞq, ?? k_< ??/ >?!
??a ?}> ????, Prometheus K " ?; ? ?I ???.
(;? ?? Ð>, ?R? HA ? ')
8

K8s ƒ À² k_<8 ?;?? ? '
PrometheusD à…3 ?~?? b? wb?: Thanos ì Mimir
Mimir + Alloy
•Mimir? wb? ?: ?, k_< ?? ? à ??I ??.
•?(D Collector (Alloy a) À² k_<8 ????, Mimir? y?.
Thanos + Prometheus
•Prometheus K ffi ? Thanos ŽIX > ffi??.
•k_< ?? Ô É7 Èyd Prometheus> ??.
Thanos, Mimir }A HA ?? ? I b? ?I >???.
?? K8 ???? ?/? ? n? î0î.
9

Mimir À k_<8 yàÞÝ $ç >?? ??\
•1?D ?, k_< ?? ?v >? ? ??> S:.
•2, 3?D ?, ?? D Collector> 2L? I 3 ??? ? wb? ? ?I F ??.
10

Mimir Ingester TSDBD I wb ??
•?x5?, }Z I > &@ ý²ý \ ? ?? j5? @±àî. (In-Order)
• &@ ?? ? O? ?? I > \ ? ?v, ?(D wb ÐýÁ 3 q ;?. (Out-Of-Order)
⇢ Out-Of-Order I × ? MimirD v>xK I ? Ð>ý I ???.
•??(Prometheus TSDB ??): https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
•??(Mimir IngesterD OOO I ?;): https://grafana.com/blog/2022/09/07/new-in-grafana-mimir-introducing-out-of-order-sample-ingestion/
11

^ì… Alloy× Clustering }X3 Ñàî.
Clustering5ý CollectorCî ûØ v8 ‘á û S:.
‡Ð 6™(D ¹¨„_ À² clustering èÌ8 Ñã.
•prometheus.scrape
•prometheus.operator.servicemonitors
•prometheus.operator.podmonitors
•loki.source.kubernetes
•loki.source.podlogs
•pyroscope.scrape
–‘d ¿àîŠ œ7 j7 ‡é. (v>xK CPU/Memory ×})
Collector ( À O~, Ù¿I Sîv ¿ÞW.
12

Alloy }X? k_< ?? ??3 :??v…
Alloy Clustering :? &
k_< ?? ??
Alloy Clustering ? &
k_< ?? ??
Alloy Clustering }X?, Mimir rD Out-Of-Order I 3 m? & ? S?.
13

Out-Of-Order I m×ý Mimir wb× Ý8 ‡\ û Sî.
k_< ûØ Ñ ' À n2
Mimir Ingester 1D PVC ŽÝ :® (30Gi ݳ)
(PVC× ñÝ wb× 7;Ñ Éá)
k_< ûØ Ñ ' À n2
S3 Object StorageD ŽÝ :® (31L– ݳ)
(Object Storage× bÝ wb× 7;Ñ Éá)
Dev ƒ K8s yS üýX k_< ûØ Ý³, MimirD PVC ì S3 ŽÝ }A ¡ 3Ó 0I3 Ó:.
14

" K8s ?D Trace ??/wb À à I  Ý
15

Trace I D ? ??
Attributes ì Events3 K? Span ÈÞP> } ? ?aD Trace3 I?.
Trace?, `Ú8 ?? Span I \D Øä I?? I × MSA ƒ À² `d ,
16

??D ²: À² Trace I 3 vyç   àî
OpenTelemetry Auto-instrumentation >?. ??a…
` :?? ??D * ?: ??( ??? Trace * Trace
•Auto-instrumentation8 ?v, ?: D ?X ? ?I( Trace I vyI >???.
•?, L ?: ? `¿ À n? span ??I D( ? î0q a ? ? S:.
•w^D ??, ` ?? ? n? 2X?? üýX À² DB Connection Check ?? SpanI PP ¹I : ? #??.
17

Tail Sampling5? ?? Span8 ?b
?b? SpanD ?> attribute3 u ??v xá ? S?.
D(? ‹7 * Trace À²
Ù Bá spanD ?> attribute u ˆ
Collector èð À²
tail_sampling component3 >ç
?? span ?b ?d v>
Tempo À× Ù B8 b? Trace> wb;?.
→ →
•??(Tail Sampling): https://grafana.com/docs/tempo/latest/configuration/grafana-alloy/tail-sampling/
18

Ô ƒ À² 100% sampling7 xäÞÑ ??
`IŽæI ?? ? "D Trace? xq ???? 0:. ??a…
# *??? ?aD 2L? Trace?E,
?%?v ?? ?2 TraceJ? a? ? P.
(?F7 <root span not yet received>)
•+x(probabilistic)  ÒB5ý L? Trace3 n%? Ù Bî5a, Grafana š À²× 2L Trace> ???? a? ? ? &=.
•I? Grafana ? I? v À² R F7 <root span not yet received>3 ,??.
$ D(? Trace ‘% ~?
tail_sampling component3 >ç
+x  ÒB 10% ?d v>
19

Tail Sampling y ?d: }Z spanI M7 collector À (?? ? ã
K8s aD ? ƒ À²× ?? ? ?X?? / ( Collector> ???.
Grafana Tempo Docs

(p18 ?? B? ? 2L)
•OpenTelemetry load-balancing exporter
•??: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/loadbalancingexporter
Collector3 2??5? ³¿àî.
(I D ÓQ| collector À load-balancing exporter ??)
20

% Cross Zone Traffic
21

wb? ?z ²: À², ???_@ Cross Zone TrafficI ???
Zone@ _Ð× y? :8 ?? Public Cloud ?I?v ,D? &?.
rxK R???(Consistent hashing)5? Kç
*??? ???_@ Pod IP ??5? > (àî.
(Service DNS ?? > (I ??)
Distributor → Ingester À²
Cross Zone _Ð×I ?? ʤàî.
(> Mimir> I??J? a ?)
%
22

wb? ?z ²: À², ???_@ Cross Zone TrafficI ???
gRPC ‘w8 xÞ È, Traffic :8 m?? ? S #??.
(:? r
•Mimir Production Guide: https://grafana.com/docs/mimir/latest/manage/run-production-environment/production-tips/#network
•Mimir(2.16.0) gRPC S2 Compression: https://github.com/grafana/mimir/pull/9322
23

! Lesson Learned
" Grafana Stack ?: @ ?2?7 ? ?i. ??a, ÒE ?$> ??.
•I ?: \7 KubernetesE8 Šðà ?: > ??. ÒE ?2 ? '? ??(( ? ??q a ? ? S?.
•?, Multi Tenancy ??7 >?> ? ?: (wb ?@ ݳ, ?: (kK ݳ a , ??q ; >?)
" ?: D `¿ ? ?? ??3 , ,d u ????, ObservabilityD ?;m( MI ? ???.
•LD ~? ? ySD ~?> 2L???? 100% b?? ? ??. (?? `IŽæ7 ??? C?)
•?? K8s Observability ™| À², Trace? Metric / Log À :? ‡Ò7 ÓQ b?I – S? ž.
• ?W> ?? E, ObservabilityD ?;m> ? ???.
?? ?$> ?8??, àã À }Z j8 ðv ?8 O8 ?( S:.
E ? ?8 O7 j M?v, ?x? ^ & ?$3 ?? X7 ?$ }??W &
24

End Of Document
25