高级会员
- 积分
- 734
- 金钱
- 734
- 注册时间
- 2013-11-1
- 在线时间
- 28 小时
|
发帖子竟然不支持markdown语法,懒得调整格式了
### object tree
irqbalance是在一个周期内统计各个cpu的中断情况,重新对终端在分配,实现cpu均衡,其中有个比较重要的概念是 **"object tree"**
>`object tree`是通过系统的拓扑结构建立的分层结构.主要是NUMA node/package/cache/cpu将系统由上而下划分为四层:`node->package->cache->cpu`
以16核双路服务器为例,系统有两个numa node(摘录网络)
![SMP 拓扑结构](https://imgconvert.csdnimg.cn/aH ... rocess=image/format,png)
对应的object tree拓扑结构如图:
![拓扑结构](https://imgconvert.csdnimg.cn/aH ... rocess=image/format,png)
### 拓扑结构(build_object_tree)
* 实现创建扩图结构以及中断数据结构,并初始化.
* 通过遍历/sys/devices/system/node/node,决定有多少OBJ_TYPE_NODE的对象。
* 通过遍历/sys/devices/system/cpu/cpu,以及是否online,决定有多少OBJ_TYPE_CPU的对象
* 通过遍历/sys/devices/system/cpu/cpu/cache/index[MAX]/shared_cpu_map决定有多少OBJ_TYPE_CACHE对象。
* 通过遍历/topology/physical_package_id决定有多少OBJ_TYPE_PACKAGE的对象。
* 通过遍历/sys/bus/pci/devices/0000:00:[**]./下irq以及msi,建立各个irq的数据,这样irqbalance就知道该irq属于哪个node以及smp_affinity.
* 代码
```
static void free_object_tree(void)
{
build_numa_node_list(); // 遍历/sys/devices/system/node/node,统计有多少OBJ_TYPE_NODE对象
parse_cpu_tree(); // 遍历/sys/devices/system/cpu/cpu 统计有多少OBJ_TYPE_CPU对象等
rebuild_irq_db(); // 建立irq的db
}
```
所以irqblance初始的smp_affinity的值是local_cpus(0-9,40-49)
### 中断类型
irqblance根据所属的device的pci配置空间的class code把中断分层以下8种类型:
```
char *classes[] = {
"other",
"legacy",
"storage",
"video",
"ethernet",
"gbit-ethernet",
"10gbit-ethernet",
"virt-event",
0
};
```
具体代码是读取设备的class code 右移16bit
例如千兆网卡:
```
01:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
```
```
[root@node-2 0000:00:01.0]# cat /sys/bus/pci/devices/0000:00:01.0/class
0x060400
```
### 中断层级(irq_info.level)
每种中断类型和分配方式有着一个固定的对应关系,分配方式:
```
#define BALANCE_NONE 0 //表示中断不能进行迁移
#define BALANCE_PACKAGE 1 //表示中断只能在package层进行均衡
#define BALANCE_CACHE 2 //表示中断只能在cache层进行均衡
#define BALANCE_CORE 3 //表示中断只能在core层进行均衡
```
对应关系如下:
![](https://imgconvert.csdnimg.cn/aH ... rocess=image/format,png)
具体的代码定义如下:
static int map_class_to_level[8] =
{ BALANCE_PACKAGE, BALANCE_CACHE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE };
### 中断计数(parse_proc_interrupts)
统计中断在所有cpu的次数总和,主要就是解析/proc/interrupt里面的数据,具体代码:
```
void parse_proc_interrupts(void)
{
...
/*统计得到的CPU域数目与CPU域链表中数目不匹配,此时应该重新建立域拓扑结构*/
if (cpunr != core_count) {
need_rescan = 1;
break;
}
/* 因为中断移除和插入会出现的情况,此时应该重新建立域拓扑结构 */
if (count < info->irq_count) {
need_rescan = 1;
break;
}
info->last_irq_count = info->irq_count;
info->irq_count = count;
...
}
```
### 中断负载(parse_proc_stat)
计算中断在一个周期(10s)在cpu上面的时间作为中断的负载,主要解析
![中断负载]()
* cpu节点load的计算方法:
如上图红框中倒数4,5列分别是在各个cpu中的时间总和(单位是jiffies)
cpu->load = (irq_load + softirq_load) - (cpu->last_load);
* 父节点负载=给子节点负载的总和
* 计算各个对象上单位周期内的平均中断数local_count.
具体代码如下:
```
parse_proc_stat(void)
{
...
/*
* For each cpu add the irq and softirq load and propagate that
* all the way up the device tree
*/
if (cycle_count) {
cpu->load = (irq_load + softirq_load) - (cpu->last_load);
/*
* the [soft]irq_load values are in jiffies, with
* HZ jiffies per second. Convert the load to nanoseconds
* to get a better integer resolution of nanoseconds per
* interrupt.
*/
cpu->load *= NSEC_PER_SEC/HZ;
}
cpu->last_load = (irq_load + softirq_load);
...
for_each_object(cpus, compute_irq_branch_load_share, NULL); //计算cpu中每个interrupt中的负载
for_each_object(cache_domains, compute_irq_branch_load_share, NULL);
for_each_object(packages, compute_irq_branch_load_share, NULL);
for_each_object(numa_nodes, compute_irq_branch_load_share, NULL);
}
```
compute_irq_branch_load_share
```
static void compute_irq_branch_load_share(struct topo_obj *d, void *data __attribute__((unused)))
{
uint64_t local_irq_counts = 0;
uint64_t load_slice;
int load_divisor = g_list_length(d->children);
d->load /= (load_divisor ? load_divisor : 1);
if (g_list_length(d->interrupts) > 0)
{
local_irq_counts = get_parent_branch_irq_count_share(d); // 计算各个节点平均的中断计数 遍历/sys/devices/system/cpu/cpu 统计有多少OBJ_TYPE_CPU对象等
load_slice = local_irq_counts ? (d->load / local_irq_counts) : 1; // 平均的时间片段
for_each_irq(d->interrupts, assign_load_slice, &load_slice); // 计算中断的负载
}
if (d->parent)
d->parent->load += d->load;
}
```
计算单位时间内的平均中断数
```
static uint64_t get_parent_branch_irq_count_share(struct topo_obj *d)
{
uint64_t total_irq_count = 0;
if (d->parent) {
total_irq_count = get_parent_branch_irq_count_share(d->parent);
total_irq_count /= g_list_length((d->parent)->children);
}
if (g_list_length(d->interrupts) > 0)
for_each_irq(d->interrupts, accumulate_irq_count, &total_irq_count);
return total_irq_count;
}
```
计算单位时间内(10s)interrupt在不同域中的负载
```
static void assign_load_slice(struct irq_info *info, void *data)
{
uint64_t *load_slice = data;
info->load = (info->irq_count - info->last_irq_count) * *load_slice;
if (!info->load)
info->load++;
}
```
### 更新迁移状态(update_migration_status)
* 遍历各个对象,计算该层的平均负载info->avg_load, 标准差info->std_deviation
* 利用标准差来表示中断负载离散程度,公式如下:
![]()
```
static void find_overloaded_objs(GList *name, struct load_balance_info *info)
{
memset(info, 0, sizeof(struct load_balance_info));
for_each_object(name, gather_load_stats, info);
info->load_sources = (info->load_sources == 0) ? 1 : (info->load_sources);
info->avg_load = info->total_load / info->load_sources; //负载均值
for_each_object(name, compute_deviations, info);
/* Don't divide by zero if there is a single load source */
if (info->load_sources == 1)
info->std_deviation = 0;
else {
info->std_deviation = (long double)(info->deviations / (info->load_sources - 1)); //得到方差
info->std_deviation = sqrt(info->std_deviation); //得到标准差
}
for_each_object(name, migrate_overloaded_irqs, info);
}
```
* 遍历各个对象找到大于info->min_load的节点并把该节点的所有中断从小到大排序,并依次移除放到为rebalance_irq_list的链表中
```
static void migrate_overloaded_irqs(struct topo_obj *obj, void *data)
{
...
if ((obj->load + info->std_deviation) <= info->avg_load) {
info->num_under++;
if (power_thresh != ULONG_MAX && !info->powersave)
if (!obj->powersave_mode)
info->powersave = obj;
} else if ((obj->load - info->std_deviation) >=info->avg_load) {
info->num_over++;
}
if ((obj->load > info->min_load) &&
(g_list_length(obj->interrupts) > 1)) {
sort_irq_list(&obj->interrupts);
info->adjustment_load = obj->load;
for_each_irq(obj->interrupts, move_candidate_irqs, info);
}
}
```
* 中断从节点迁移后会更新该节点的负载以及min_load,当两者最接近时停止迁移中断。
```
if ((lb_info->adjustment_load - info->load) > (lb_info->min_load + info->load)) {
lb_info->adjustment_load -= info->load;
lb_info->min_load += info->load;
} else
return;
```
计算迁移位置(calculate_placement)
* 首先排序rebalance_irq_list链表
* 根据中断查找中断所在的numa nodeID,并迁到到所在node里面的中断链表里面,中断不能跨节点迁移
```
migrate_irq(&rebalance_irq_list, &irq_numa_node(info)->interrupts, info);
info->assigned_obj = irq_numa_node(info);
irq_numa_node(info)->load += info->load + 1;
```
* 逐级遍历各个域查找负载最小的节点并迁移过来节点
```
for_each_object(numa_nodes, place_irq_in_object, NULL);
for_each_object(packages, place_irq_in_object, NULL);
for_each_object(cache_domains, place_irq_in_object, NULL);
...
for_each_object(d->children, find_best_object, &place);
asign = place.least_irqs ? place.least_irqs : place.best;
if (asign) {
migrate_irq(&d->interrupts, &asign->interrupts, info);
info->assigned_obj = asign;
asign->load += info->load;
}
...
```
* 激活中断映射
修改 /proc/irq/xxx/smp_affinity,使处理生效
### 节能模式(powersave_mode)
`irqbalance`的`powersave_mode`默认是关闭的,但是用户可以通过命令设置一个`power_thresh`,开启该模式,会根据负载情况自动切换省电模式和正常模式. ` power_thresh != ULONG_MAX`,系统默认 power_thresh的值是ULONG_MAX,所以不会进入该分支设置`info.powersave->powersave_mode`
假设是开始的情况下:
* 在`migrate_overloaded_irqs`中会按照算法把`(obj->load + info->std_deviation) <= info->avg_load`的统计为`powersave`
```
static void migrate_overloaded_irqs(struct topo_obj *obj, void *data)
{
struct load_balance_info *info = data;
if (obj->powersave_mode)
info->num_powersave++;
if ((obj->load + info->std_deviation) <= info->avg_load) {
info->num_under++;
if (power_thresh != ULONG_MAX && !info->powersave)
if (!obj->powersave_mode)
info->powersave = obj;
} else if ((obj->load - info->std_deviation) >=info->avg_load) {
info->num_over++;
```
* 在`update_migration_status`函数里面会把置为`powersave`的节点中断强制迁移走,让这些cpu休眠
```
if (power_thresh != ULONG_MAX && cycle_count > 5) {
if (!info.num_over && (info.num_under >= power_thresh) && info.powersave) {
log(TO_ALL, LOG_INFO, "cpu %d entering powersave mode", info.powersave->number);
info.powersave->powersave_mode = 1;
if (g_list_length(info.powersave->interrupts) > 0)
for_each_irq(info.powersave->interrupts, force_irq_migration, NULL); //迁移到rebalance的链表里面
} else if ((info.num_over) && (info.num_powersave)) {
log(TO_ALL, LOG_INFO, "Load average increasing, re-enabling all cpus for irq balancing");
for_each_object(cpus, clear_powersave_mode, NULL);
}
}
```
|
|