kube-scheduler在Kubernetes中负责Pods的调度,其主要流程是获取未被调度的pod,然后根据pod的信息过滤出符合要求的nodes,接着对这些符合要求的nodes进行打分,最后把得分最高的node作为pod的调度结果。所以,在kube-scheduler中有两类算法,一种是用来过滤nodes的算法,称为predicate类;另一种是来用打分的算法,称为priority类。本次分析,就是介绍kube-scheduler是如何对算法进行管理的。

algorithmprovider

在kube-scheduler的入口,/plugin/cmd/kube-scheduler/app/server.go中,有如下引入包:

1
_ "k8s.io/kubernetes/plugin/pkg/scheduler/algorithmprovider"

在引入包时,Go语言会自动执行该包的init()函数。所以我们来看下algorithmprovider的init()函数,定义在/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
func init() {
// Register functions that extract metadata used by predicates and priorities computations.
factory.RegisterPredicateMetadataProducerFactory(
func(args factory.PluginFactoryArgs) algorithm.MetadataProducer {
return predicates.NewPredicateMetadataFactory(args.PodLister)
})
factory.RegisterPriorityMetadataProducerFactory(
func(args factory.PluginFactoryArgs) algorithm.MetadataProducer {
return priorities.PriorityMetadata
})
// Retisters algorithm providers. By default we use 'DefaultProvider', but user can specify one to be used
// by specifying flag.
//***Fankang***//
//***注册"DefaultProvider"***//
factory.RegisterAlgorithmProvider(factory.DefaultProvider, defaultPredicates(), defaultPriorities())
// Cluster autoscaler friendly scheduling algorithm.
factory.RegisterAlgorithmProvider(ClusterAutoscalerProvider, defaultPredicates(),
copyAndReplace(defaultPriorities(), "LeastRequestedPriority", "MostRequestedPriority"))
// Registers predicates and priorities that are not enabled by default, but user can pick when creating his
// own set of priorities/predicates.
// PodFitsPorts has been replaced by PodFitsHostPorts for better user understanding.
// For backwards compatibility with 1.0, PodFitsPorts is registered as well.
factory.RegisterFitPredicate("PodFitsPorts", predicates.PodFitsHostPorts)
// Fit is defined based on the absence of port conflicts.
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate("PodFitsHostPorts", predicates.PodFitsHostPorts)
// Fit is determined by resource availability.
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate("PodFitsResources", predicates.PodFitsResources)
// Fit is determined by the presence of the Host parameter and a string match
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate("HostName", predicates.PodFitsHost)
// Fit is determined by node selector query.
factory.RegisterFitPredicate("MatchNodeSelector", predicates.PodSelectorMatches)
// Use equivalence class to speed up predicates & priorities
factory.RegisterGetEquivalencePodFunction(GetEquivalencePod)
// ServiceSpreadingPriority is a priority config factory that spreads pods by minimizing
// the number of pods (belonging to the same service) on the same node.
// Register the factory so that it's available, but do not include it as part of the default priorities
// Largely replaced by "SelectorSpreadPriority", but registered for backward compatibility with 1.0
factory.RegisterPriorityConfigFactory(
"ServiceSpreadingPriority",
factory.PriorityConfigFactory{
Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
return priorities.NewSelectorSpreadPriority(args.ServiceLister, algorithm.EmptyControllerLister{}, algorithm.EmptyReplicaSetLister{})
},
Weight: 1,
},
)
// EqualPriority is a prioritizer function that gives an equal weight of one to all nodes
// Register the priority function so that its available
// but do not include it as part of the default priorities
factory.RegisterPriorityFunction2("EqualPriority", scheduler.EqualPriorityMap, nil, 1)
// ImageLocalityPriority prioritizes nodes based on locality of images requested by a pod. Nodes with larger size
// of already-installed packages required by the pod will be preferred over nodes with no already-installed
// packages required by the pod or a small total size of already-installed packages required by the pod.
factory.RegisterPriorityFunction2("ImageLocalityPriority", priorities.ImageLocalityPriorityMap, nil, 1)
// Optional, cluster-autoscaler friendly priority function - give used nodes higher priority.
factory.RegisterPriorityFunction2("MostRequestedPriority", priorities.MostRequestedPriorityMap, nil, 1)
}

init()中通过调用RegisterAlgorithmProvider()注册了DefaultProvider;通过RegisterFitredicate()注册predicate类函数;通过RegisterPriorityFunctions2()注册priority类函数。
在kube-scheduler中注册了以下算法函数:
predicates类:NoVolumeZoneConflict, MaxEBSVolumeCount, MaxGCEPDVolumeCount, MatchInterPodAffinity, NoDiskConflict, GeneralPredicates, PodToleratesNodeTaints, CheckNodeMemoryPressure, CheckNodeDIskPressure, ClusterAutoscalerProvider, PodFitsPorts, PodFitsHostPorts, PodFitsResource, HostName, MatchNodeSelector。
priority类:SelectorSpreadPriority, InterPodAffinityPriority, LeastRequestedPriority, BalancedResourceAllocation, NodePreferAvoidPodsPriority, NodeAffinityPriority, TaintTolerationPriority, servicespreadingPriority, EqualPriority, ImageLocalityPriority, MostRequestedPriority。
这些算法的功能会在以后逐一介绍。
先来看下:

1
factory.RegisterAlgorithmProvider(factory.DefaultProvider, defaultPredicates(), defaultPriorities())

该代码注册了DefaultProvider,来看defaultPredicates():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
func defaultPredicates() sets.String {
return sets.NewString(
// Fit is determined by volume zone requirements.
//***注册NoVolumeZoneConflict,并返回字符串"NoVolumeZoneConflict"***//
factory.RegisterFitPredicateFactory(
"NoVolumeZoneConflict",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewVolumeZonePredicate(args.PVInfo, args.PVCInfo)
},
),
// Fit is determined by whether or not there would be too many AWS EBS volumes attached to the node
//***注册MaxEBSVolumeCount,并返回字符串"MaxEBSVolumeCount"***//
factory.RegisterFitPredicateFactory(
"MaxEBSVolumeCount",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly
maxVols := getMaxVols(aws.DefaultMaxEBSVolumes)
return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilter, maxVols, args.PVInfo, args.PVCInfo)
},
),
// Fit is determined by whether or not there would be too many GCE PD volumes attached to the node
//***注册MaxGCEPDVolumeCount,并返回字符串"MaxGCEPDVolumeCount"***//
factory.RegisterFitPredicateFactory(
"MaxGCEPDVolumeCount",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly
maxVols := getMaxVols(DefaultMaxGCEPDVolumes)
return predicates.NewMaxPDVolumeCountPredicate(predicates.GCEPDVolumeFilter, maxVols, args.PVInfo, args.PVCInfo)
},
),
// Fit is determined by inter-pod affinity.
//***注册MatchInterPodAffinity,并返回字符串"MatchInterPodAffinity"***//
factory.RegisterFitPredicateFactory(
"MatchInterPodAffinity",
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewPodAffinityPredicate(args.NodeInfo, args.PodLister, args.FailureDomains)
},
),
// Fit is determined by non-conflicting disk volumes.
//***注册NoDiskConflict,并返回字符串"NoDiskConflict"***//
factory.RegisterFitPredicate("NoDiskConflict", predicates.NoDiskConflict),
// GeneralPredicates are the predicates that are enforced by all Kubernetes components
// (e.g. kubelet and all schedulers)
//***注册GeneralPredicates,并返回字符串"GeneralPredicates"***//
factory.RegisterFitPredicate("GeneralPredicates", predicates.GeneralPredicates),
// Fit is determined based on whether a pod can tolerate all of the node's taints
//***注册PodToleratesNodeTaints,并返回字符串"PodToleratesNodeTaints"***//
factory.RegisterFitPredicate("PodToleratesNodeTaints", predicates.PodToleratesNodeTaints),
// Fit is determined by node memory pressure condition.
//***注册CheckNodeMemoryPressure,并返回字符串"CheckNodeMemoryPressure"***//
factory.RegisterFitPredicate("CheckNodeMemoryPressure", predicates.CheckNodeMemoryPressurePredicate),
// Fit is determined by node disk pressure condition.
//***注册CheckNodeDiskPressure,并返回字符串"CheckNodeDiskPressure"***//
factory.RegisterFitPredicate("CheckNodeDiskPressure", predicates.CheckNodeDiskPressurePredicate),
)
}

可以看出,defaultPredicates()通过调用RegisterFitPredicateFactory()或RegisterFitPredicate()来注册predicates相关函数。defaultsPredicates()返回的是注册的函数的名称的集合。

再来看下defaultPriorities():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
func defaultPriorities() sets.String {
return sets.NewString(
// spreads pods by minimizing the number of pods (belonging to the same service or replication controller) on the same node.
//***注册SelectorSpreadPriority,并返回字符串"SelectorSpreadPriority"***//
factory.RegisterPriorityConfigFactory(
"SelectorSpreadPriority",
factory.PriorityConfigFactory{
Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
return priorities.NewSelectorSpreadPriority(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister)
},
Weight: 1,
},
),
// pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
// as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
//***注册InterPodAffinityPriority,并返回字符串"InterPodAffinityPriority"***//
factory.RegisterPriorityConfigFactory(
"InterPodAffinityPriority",
factory.PriorityConfigFactory{
Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
return priorities.NewInterPodAffinityPriority(args.NodeInfo, args.NodeLister, args.PodLister, args.HardPodAffinitySymmetricWeight, args.FailureDomains)
},
Weight: 1,
},
),
// Prioritize nodes by least requested utilization.
//***注册LeastRequestedPriority,并返回字符串"LeastRequestedPriority"***//
factory.RegisterPriorityFunction2("LeastRequestedPriority", priorities.LeastRequestedPriorityMap, nil, 1),
// Prioritizes nodes to help achieve balanced resource usage
//***注册BalancedResourceAllocation,并返回字符串"BalancedResourceAllocation"***//
factory.RegisterPriorityFunction2("BalancedResourceAllocation", priorities.BalancedResourceAllocationMap, nil, 1),
// Set this weight large enough to override all other priority functions.
// TODO: Figure out a better way to do this, maybe at same time as fixing #24720.
//***注册NodePreferAvoidPodsPriority,并返回字符串"NodePreferAvoidPodsPriority"***//
factory.RegisterPriorityFunction2("NodePreferAvoidPodsPriority", priorities.CalculateNodePreferAvoidPodsPriorityMap, nil, 10000),
// Prioritizes nodes that have labels matching NodeAffinity
//***注册NodeAffinityPriority,并返回字符串"NodeAffinityPriority"***//
factory.RegisterPriorityFunction2("NodeAffinityPriority", priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1),
// TODO: explain what it does.
//***注册TaintTolerationPriority,并返回字符串"TaintTolerationPriority"***//
factory.RegisterPriorityFunction2("TaintTolerationPriority", priorities.ComputeTaintTolerationPriorityMap, priorities.ComputeTaintTolerationPriorityReduce, 1),
)
}

defaultPriorities()通过RegisterPriorityConfigFactory()或RegisterPriorityFunction2()来注册priority相关函数。defaultPriorities()返回的是注册的函数的名称的集合。

注册管理

算法函数的注册是通过3个公共Map来管理的,定义在/plugin/pkg/scheduler/factory/plugins.go中:

1
2
3
4
5
//***algorithmProviderMap, fitPredicateMap和priorityFunctionMap三个全局变量***//
//***可以从algorithmProviderMap中找到对应名称的fit函数名称和priority函数名称***//
fitPredicateMap = make(map[string]FitPredicateFactory)
priorityFunctionMap = make(map[string]PriorityConfigFactory)
algorithmProviderMap = make(map[string]AlgorithmProviderConfig)

其中fitPredicateMap记录了算法名称和predicate算法函数的map关系;priorityFunctionMap记录了算法名称和priority算法函数的map关系;algorithmProviderMap记录了名称和AlgorithmProviderConfig的关系。
AlgorithmProviderConfig的定义如下:

1
2
3
4
5
//***AlgorithmProviderConfig中有FitPredicateKeys和PriorityFUnctionKeys***//
type AlgorithmProviderConfig struct {
FitPredicateKeys sets.String
PriorityFunctionKeys sets.String
}

可以看出,AlgorithmProviderConfig包含FitPredicateKeys和PriorityFunctionKeys,用来记录指定providerName对应的算法的名称。这就可以理解这三个全局变量的关系了,algorithmProviderMap可以根据providerName找到算法名称集;根据算法名称又可以在fitPredicateMap或prioritFunctionMap中找到对应的算法函数。
我们先来看如何向fitPredicateMap注册,相关函数定义在/plugin/pkg/scheduler/factory/plugins.go中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// RegisterFitPredicate registers a fit predicate with the algorithm
// registry. Returns the name with which the predicate was registered.
func RegisterFitPredicate(name string, predicate algorithm.FitPredicate) string {
return RegisterFitPredicateFactory(name, func(PluginFactoryArgs) algorithm.FitPredicate { return predicate })
}
// RegisterFitPredicateFactory registers a fit predicate factory with the
// algorithm registry. Returns the name with which the predicate was registered.
//***向fitPredicateMap注册predicateFactory,并返回name***//
func RegisterFitPredicateFactory(name string, predicateFactory FitPredicateFactory) string {
schedulerFactoryMutex.Lock()
defer schedulerFactoryMutex.Unlock()
validateAlgorithmNameOrDie(name)
fitPredicateMap[name] = predicateFactory
return name
}

可以使用RegisterFitPredicate()或RegiserFitPredicateFactory()向fitPredicateMap中注册predicate算法函数。

再来看下如何向priorityFunctionMap注册,相关函数定义在/plugin/pkg/scheduler/factory/plugins.go中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Registers a priority function with the algorithm registry. Returns the name,
// with which the function was registered.
// FIXME: Rename to PriorityFunctionFactory.
func RegisterPriorityFunction2(
name string,
mapFunction algorithm.PriorityMapFunction,
reduceFunction algorithm.PriorityReduceFunction,
weight int) string {
return RegisterPriorityConfigFactory(name, PriorityConfigFactory{
MapReduceFunction: func(PluginFactoryArgs) (algorithm.PriorityMapFunction, algorithm.PriorityReduceFunction) {
return mapFunction, reduceFunction
},
Weight: weight,
})
}
//***向priorityFunctionMap注册Priority Function,并返回name***//
func RegisterPriorityConfigFactory(name string, pcf PriorityConfigFactory) string {
schedulerFactoryMutex.Lock()
defer schedulerFactoryMutex.Unlock()
validateAlgorithmNameOrDie(name)
priorityFunctionMap[name] = pcf
return name
}

可以使用RegisterPriorityFunctions2()或registerPriorityConfigFactory()向priorityFunctionMap中注册priority算法函数。

最后来看下如何向algorithmProviderMap中注册。相关函数定义在/plugin/pkg/scheduler/factory/plugins.go:

1
2
3
4
5
6
7
8
9
10
11
//***向algorithmProviderMap注册AlgorithmProvider***//
func RegisterAlgorithmProvider(name string, predicateKeys, priorityKeys sets.String) string {
schedulerFactoryMutex.Lock()
defer schedulerFactoryMutex.Unlock()
validateAlgorithmNameOrDie(name)
algorithmProviderMap[name] = AlgorithmProviderConfig{
FitPredicateKeys: predicateKeys,
PriorityFunctionKeys: priorityKeys,
}
return name
}

总结

所以,在kube-scheduler的初始化过程中,使用fitPredicateMap,priorityFunctionMap,algorithmProviderMap三个公共变量来管理算法函数,所有需要的算法函数都会按类别注册到fitPredicateMap或priorityFuncitonMap中。