Discussion:
[BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node
yangjihong
2017-12-13 09:25:07 UTC
Permalink
Hello,

I am doing stressing testing on 3.10 kernel(centos 7.4), to constantly starting numbers of docker ontainers with selinux enabled, and after about 2 days, the kernel softlockup panic:
<IRQ> [<ffffffff810bb778>] sched_show_task+0xb8/0x120
[<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
[<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
[<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
[<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
[<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
[<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
[<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
[<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
<EOI> [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
[<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
[<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
[<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
[<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
[<ffffffff812b1960>] ? sel_write_member+0x200/0x200
[<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
[<ffffffff811f444d>] vfs_write+0xbd/0x1e0
[<ffffffff811f4eef>] SyS_write+0x7f/0xe0
[<ffffffff8166d433>] system_call_fastpath+0x16/0x1b

My opinion:
when the docker container starts, it would mount overlay filesystem with different selinux context, mount point such as:
overlay on /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/merged type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c414,c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADPARHHWY7:/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327ca577b8f5d9d6a4adf218d4876/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c414,c873",size=65536k)
overlay on /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d165b94353eefab/merged type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c431,c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHPAVRCRSS:/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1dcf05a65866458523ffd4a71614/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c431,c651",size=65536k)

sidtab_search_context check the context whether is in the sidtab list, If not found, a new node is generated and insert into the list, As the number of containers is increasing, context nodes are also more and more, we tested the final number of nodes reached 300,000 +, sidtab_context_to_sid runtime needs 100-200ms, which will lead to the system softlockup.

Is this a selinux bug? When filesystem umount, why context node is not deleted? I cannot find the relevant function to delete the node in sidtab.c

Thanks for reading and looking forward to your reply.
Stephen Smalley
2017-12-13 15:18:16 UTC
Permalink
Hello, 
I am doing stressing testing on 3.10 kernel(centos 7.4), to
constantly starting numbers of docker ontainers with selinux enabled,
 <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
 [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
 [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
 [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
 [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
 [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
 [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
 [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
 [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
 <EOI>  [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
 [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
 [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
 [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
 [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
 [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
 [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
 [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
 [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
 [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount overlay filesystem
with different selinux context, mount point such as: 
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc32
6cb07495ca08fc9ddb66/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c414,
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/docker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/overl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca
14ff6d165b94353eefab/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c431,
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d080145c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1dcf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in the sidtab
list, If not found, a new node is generated and insert into the list,
As the number of containers is increasing,  context nodes are also
more and more, we tested the final number of nodes reached 300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which will lead to the
system softlockup.
Is this a selinux bug? When filesystem umount, why context node is
not deleted?  I cannot find the relevant function to delete the node
in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category set for every
new container, never reusing them even if the container is destroyed?
That would be a bug in docker IMHO. Or are you creating an unbounded
number of containers and never destroying the older ones?

On the selinux userspace side, we'd also like to eliminate the use of
/sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
entirely, which is what triggered this for you.

We cannot currently delete a sidtab node because we have no way of
knowing if there are any lingering references to the SID. Fixing that
would require reference-counted SIDs, which goes beyond just SELinux
since SIDs/secids are returned by LSM hooks and cached in other kernel
data structures.

sidtab_search_context() could no doubt be optimized for the negative
case; there was an earlier optimization for the positive case by adding
a cache to sidtab_context_to_sid() prior to calling it. It's a reverse
lookup in the sidtab.
yangjihong
2017-12-14 03:19:18 UTC
Permalink
Hello,
So, does docker just keep allocating a unique category set for every new container, never reusing them even if the container is destroyed?
That would be a bug in docker IMHO. Or are you creating an unbounded number of containers and never destroying the older ones?
I creat a containers, then destroy it, and create second one, destroy it.......
When docker created, it will mount overlay fs, because every containers has different selinux context, so a new sidtab node is generated and insert into the sidtab list
When docker destroyed, it will umount overlay fs, but umount operation does not seem relevant to "delete the node" hooks function, resulting in longer and longer sidtab list
I think when umount, its selinux context will never reuse, so sidtab node is useless, it is best to delete it
sidtab_search_context() could no doubt be optimized for the negative case; there was an earlier optimization for the positive case by adding a cache to sidtab_context_to_sid() prior to calling it. It's a reverse lookup in the sidtab.
I think add cache may be not very userful, because every containers has different selinux context, so when one docker created, it will search the whole sidtab list, until compare the last node, When a new node arrives, it is always necessary to compare all the nodes first, and then insert.
All as long as the list does not delete the node, list will always increase, and search time will longer and longer, eventually leading to softlockup


Is there any solution to this problem?
Thanks for reading and looking forward to your reply.

Best wishes!

-----邮件原件-----
发件人: Stephen Smalley [mailto:***@tycho.nsa.gov]
发送时间: 2017年12月13日 23:18
收件人: yangjihong <***@huawei.com>; ***@paul-moore.com; ***@parisplace.org; ***@tycho.nsa.gov; Daniel J Walsh <***@redhat.com>; Lukas Vrabec <***@redhat.com>; Petr Lautrbach <***@redhat.com>
抄送: linux-***@vger.kernel.org
主题: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node
Hello,
I am doing stressing testing on 3.10 kernel(centos 7.4), to constantly
starting numbers of docker ontainers with selinux enabled, and after
 <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
 [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
 [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
 [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
 [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
 [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
 [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
 [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
 [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
 <EOI>  [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
 [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
 [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
 [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
 [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
 [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
 [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
 [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
 [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
 [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount overlay filesystem
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc32
6cb07495ca08fc9ddb66/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c414,
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/docker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/overl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca
14ff6d165b94353eefab/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c431,
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d080145c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1dcf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in the sidtab list,
If not found, a new node is generated and insert into the list, As the
number of containers is increasing,  context nodes are also more and
more, we tested the final number of nodes reached 300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which will lead to the
system softlockup.
Is this a selinux bug? When filesystem umount, why context node is not
deleted?  I cannot find the relevant function to delete the node in
sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category set for every new container, never reusing them even if the container is destroyed?
That would be a bug in docker IMHO. Or are you creating an unbounded number of containers and never destroying the older ones?

On the selinux userspace side, we'd also like to eliminate the use of /sys/fs/selinux/user (sel_write_user -> security_get_user_sids) entirely, which is what triggered this for you.

We cannot currently delete a sidtab node because we have no way of knowing if there are any lingering references to the SID. Fixing that would require reference-counted SIDs, which goes beyond just SELinux since SIDs/secids are returned by LSM hooks and cached in other kernel data structures.

sidtab_search_context() could no doubt be optimized for the negative case; there was an earlier optimization for the positive case by adding a cache to sidtab_context_to_sid() prior to calling it. It's a reverse lookup in the sidtab.
Stephen Smalley
2017-12-14 13:07:30 UTC
Permalink
Post by yangjihong
Hello,
 So, does docker just keep allocating a unique category set for
every new container, never reusing them even if the container is
destroyed? 
 That would be a bug in docker IMHO.  Or are you creating an
unbounded number of containers and never destroying the older ones?
I creat a containers, then destroy it,  and create second one,
destroy it.......
When docker created, it will mount overlay fs, because every
containers has different selinux context, so a new sidtab node is
generated and insert into the sidtab list  
When docker destroyed, it will umount overlay fs, but umount
operation does not seem relevant to "delete the node" hooks function,
resulting in longer and longer sidtab list
I think when umount, its selinux context will never reuse, so sidtab
node is useless, it is best to delete i
The "selinux context will never reuse" is IMHO a bug in docker; if you
truly destroy the container (i.e. don't just stop its execution, but
delete it entirely), then the context should be reusable.
Post by yangjihong
 sidtab_search_context() could no doubt be optimized for the
negative case; there was an earlier optimization for the positive
case by adding a cache to sidtab_context_to_sid() prior to calling
it.  It's a reverse lookup in the sidtab.
I think add cache may be not very userful, because every containers
has different selinux context, so when one docker created, it will
search the whole sidtab list, until compare the last node, When a new
node arrives, it is always necessary to compare all the nodes first,
and then insert. 
All as long as the list does not delete the node, list will always
increase, and search time will longer and longer, eventually leading
to softlockup
Is there any solution to this problem?
On the kernel side, we could certainly implement a reverse lookup hash
table. And there could be a faster way to quickly check whether a
given category set has ever been used if we wanted to specialize in
that manner. But that won't fix the fact that docker is allocating
unbounded security contexts.
Casey Schaufler
2017-12-14 16:18:07 UTC
Permalink
Post by Stephen Smalley
Hello, 
I am doing stressing testing on 3.10 kernel(centos 7.4), to
constantly starting numbers of docker ontainers with selinux enabled,
 <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
 [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
 [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
 [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
 [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
 [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
 [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
 [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
 [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
 <EOI>  [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
 [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
 [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
 [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
 [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
 [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
 [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
 [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
 [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
 [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount overlay filesystem
with different selinux context, mount point such as: 
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc32
6cb07495ca08fc9ddb66/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c414,
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/docker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/overl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9ddb66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca
14ff6d165b94353eefab/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c431,
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d080145c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1dcf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in the sidtab
list, If not found, a new node is generated and insert into the list,
As the number of containers is increasing,  context nodes are also
more and more, we tested the final number of nodes reached 300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which will lead to the
system softlockup.
Is this a selinux bug? When filesystem umount, why context node is
not deleted?  I cannot find the relevant function to delete the node
in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category set for every
new container, never reusing them even if the container is destroyed?
That would be a bug in docker IMHO. Or are you creating an unbounded
number of containers and never destroying the older ones?
You can't reuse the security context. A process in ContainerA sends
a labeled packet to MachineB. ContainerA goes away and its context
is recycled in ContainerC. MachineB responds some time later, again
with a labeled packet. ContainerC gets information intended for
ContainerA, and uses the information to take over the Elbonian
government.
Post by Stephen Smalley
On the selinux userspace side, we'd also like to eliminate the use of
/sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
entirely, which is what triggered this for you.
We cannot currently delete a sidtab node because we have no way of
knowing if there are any lingering references to the SID. Fixing that
would require reference-counted SIDs, which goes beyond just SELinux
since SIDs/secids are returned by LSM hooks and cached in other kernel
data structures.
You could delete a sidtab node. The code already deals with unfindable
SIDs. The issue is that eventually you run out of SIDs. Then you are
forced to recycle SIDs, which leads to the overthrow of the Elbonian
government.
Post by Stephen Smalley
sidtab_search_context() could no doubt be optimized for the negative
case; there was an earlier optimization for the positive case by adding
a cache to sidtab_context_to_sid() prior to calling it. It's a reverse
lookup in the sidtab.
This seems like a bad idea.
Stephen Smalley
2017-12-14 16:42:51 UTC
Permalink
Post by Casey Schaufler
Post by Stephen Smalley
Hello, 
I am doing stressing testing on 3.10 kernel(centos 7.4), to
constantly starting numbers of docker ontainers with selinux enabled,
 <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
 [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
 [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
 [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
 [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
 [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
 [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
 [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
 [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
 <EOI>  [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
 [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
 [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
 [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
 [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
 [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
 [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
 [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
 [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
 [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount overlay
filesystem
with different selinux context, mount point such as: 
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4
bc32
6cb07495ca08fc9ddb66/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c
414,
c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADPARHH
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/do
cker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/o
verl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9dd
b66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952e
ae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327
ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt
_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258
cbca
14ff6d165b94353eefab/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c
431,
c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHPAVRC
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/v
ar/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14
ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d0801
45c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1d
cf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt
_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in the sidtab
list, If not found, a new node is generated and insert into the list,
As the number of containers is increasing,  context nodes are also
more and more, we tested the final number of nodes reached
300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which will lead to the
system softlockup.
Is this a selinux bug? When filesystem umount, why context node is
not deleted?  I cannot find the relevant function to delete the node
in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category set for every
new container, never reusing them even if the container is
destroyed? 
That would be a bug in docker IMHO.  Or are you creating an
unbounded
number of containers and never destroying the older ones?
You can't reuse the security context. A process in ContainerA sends
a labeled packet to MachineB. ContainerA goes away and its context
is recycled in ContainerC. MachineB responds some time later, again
with a labeled packet. ContainerC gets information intended for
ContainerA, and uses the information to take over the Elbonian
government.
Docker isn't using labeled networking (nor is anything else by default;
it is only enabled if explicitly configured).
Post by Casey Schaufler
Post by Stephen Smalley
On the selinux userspace side, we'd also like to eliminate the use of
/sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
entirely, which is what triggered this for you.
We cannot currently delete a sidtab node because we have no way of
knowing if there are any lingering references to the SID.  Fixing
that
would require reference-counted SIDs, which goes beyond just
SELinux
since SIDs/secids are returned by LSM hooks and cached in other kernel
data structures.
You could delete a sidtab node. The code already deals with
unfindable
SIDs. The issue is that eventually you run out of SIDs. Then you are
forced to recycle SIDs, which leads to the overthrow of the Elbonian
government.
We don't know when we can safely delete a sidtab node since SIDs aren't
reference counted and we can't know whether it is still in use
somewhere in the kernel. Doing so prematurely would lead to the SID
being remapped to the unlabeled context, and then likely to undesired
denials.
Post by Casey Schaufler
Post by Stephen Smalley
sidtab_search_context() could no doubt be optimized for the
negative
case; there was an earlier optimization for the positive case by adding
a cache to sidtab_context_to_sid() prior to calling it.  It's a
reverse
lookup in the sidtab.
This seems like a bad idea.
Not sure what you mean, but it can certainly be changed to at least use
a hash table for these reverse lookups.
Casey Schaufler
2017-12-14 17:00:46 UTC
Permalink
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Hello, 
I am doing stressing testing on 3.10 kernel(centos 7.4), to
constantly starting numbers of docker ontainers with selinux enabled,
 <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
 [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
 [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
 [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40
 [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
 [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
 [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
 [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
 [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
 <EOI>  [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480
 [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
 [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
 [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
 [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
 [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
 [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
 [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
 [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
 [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount overlay
filesystem
with different selinux context, mount point such as: 
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4
bc32
6cb07495ca08fc9ddb66/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c
414,
c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADPARHH
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/lib/do
cker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/docker/o
verl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08fc9dd
b66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952e
ae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755793449c91327
ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt
_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258
cbca
14ff6d165b94353eefab/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c
431,
c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHPAVRC
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/v
ar/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cbca14
ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d0801
45c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bcedc1d
cf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:svirt
_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in the sidtab
list, If not found, a new node is generated and insert into the list,
As the number of containers is increasing,  context nodes are also
more and more, we tested the final number of nodes reached
300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which will lead to the
system softlockup.
Is this a selinux bug? When filesystem umount, why context node is
not deleted?  I cannot find the relevant function to delete the node
in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category set for every
new container, never reusing them even if the container is
destroyed? 
That would be a bug in docker IMHO.  Or are you creating an
unbounded
number of containers and never destroying the older ones?
You can't reuse the security context. A process in ContainerA sends
a labeled packet to MachineB. ContainerA goes away and its context
is recycled in ContainerC. MachineB responds some time later, again
with a labeled packet. ContainerC gets information intended for
ContainerA, and uses the information to take over the Elbonian
government.
Docker isn't using labeled networking (nor is anything else by default;
it is only enabled if explicitly configured).
If labeled networking weren't an issue we'd have full security
module stacking by now. Yes, it's an edge case. If you want to
use labeled NFS or a local filesystem that gets mounted in each
container (don't tell me that nobody would do that) you've got
the same problem.
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
On the selinux userspace side, we'd also like to eliminate the use of
/sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
entirely, which is what triggered this for you.
We cannot currently delete a sidtab node because we have no way of
knowing if there are any lingering references to the SID.  Fixing
that
would require reference-counted SIDs, which goes beyond just
SELinux
since SIDs/secids are returned by LSM hooks and cached in other kernel
data structures.
You could delete a sidtab node. The code already deals with
unfindable
SIDs. The issue is that eventually you run out of SIDs. Then you are
forced to recycle SIDs, which leads to the overthrow of the Elbonian
government.
We don't know when we can safely delete a sidtab node since SIDs aren't
reference counted and we can't know whether it is still in use
somewhere in the kernel. Doing so prematurely would lead to the SID
being remapped to the unlabeled context, and then likely to undesired
denials.
I would suggest that if you delete a sidtab node and someone
comes along later and tries to use it that denial is exactly
what you would desire. I don't see any other rational action.
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
sidtab_search_context() could no doubt be optimized for the
negative
case; there was an earlier optimization for the positive case by adding
a cache to sidtab_context_to_sid() prior to calling it.  It's a
reverse
lookup in the sidtab.
This seems like a bad idea.
Not sure what you mean, but it can certainly be changed to at least use
a hash table for these reverse lookups.
Stephen Smalley
2017-12-14 17:15:55 UTC
Permalink
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Hello, 
I am doing stressing testing on 3.10 kernel(centos 7.4), to
constantly starting numbers of docker ontainers with selinux enabled,
 <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
 [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
 [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
 [<ffffffff811224d0>] ?
watchdog_enable_all_cpus.part.4+0x40/0x40
 [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
 [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
 [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
 [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
 [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
 <EOI>  [<ffffffff812b4193>] ?
sidtab_context_to_sid+0xb3/0x480
 [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
 [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
 [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
 [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
 [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
 [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
 [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
 [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
 [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount overlay filesystem
with different selinux context, mount point such as: 
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f
07b4
bc32
6cb07495ca08fc9ddb66/merged type overlay
s0:c
414,
c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP
ARHH
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li
b/do
cker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock
er/o
verl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f
c9dd
b66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e
952e
ae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9
1327
ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
virt
_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb
7258
cbca
14ff6d165b94353eefab/merged type overlay
s0:c
431,
c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP
AVRC
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi
r=/v
ar/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb
ca14
ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d
0801
45c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce
dc1d
cf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
virt
_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in the sidtab
list, If not found, a new node is generated and insert into
the
list,
As the number of containers is increasing,  context nodes are also
more and more, we tested the final number of nodes reached 300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which will
lead to
the
system softlockup.
Is this a selinux bug? When filesystem umount, why context
node
is
not deleted?  I cannot find the relevant function to delete
the
node
in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category set for every
new container, never reusing them even if the container is
destroyed? 
That would be a bug in docker IMHO.  Or are you creating an
unbounded
number of containers and never destroying the older ones?
You can't reuse the security context. A process in ContainerA sends
a labeled packet to MachineB. ContainerA goes away and its
context
is recycled in ContainerC. MachineB responds some time later, again
with a labeled packet. ContainerC gets information intended for
ContainerA, and uses the information to take over the Elbonian
government.
Docker isn't using labeled networking (nor is anything else by default;
it is only enabled if explicitly configured).
If labeled networking weren't an issue we'd have full security
module stacking by now. Yes, it's an edge case. If you want to
use labeled NFS or a local filesystem that gets mounted in each
container (don't tell me that nobody would do that) you've got
the same problem.
Even if someone were to configure labeled networking, Docker is not
presently relying on that or SELinux network enforcement for any
security properties, so it really doesn't matter. And if they wanted
to do that, they'd have to coordinate category assignments across all
systems involved, for which no facility exists AFAIK. If you have two
docker instances running on different hosts, I'd wager that they can
hand out the same category sets today to different containers.

With respect to labeled NFS, that's also not the default for nfs
mounts, so again it is a custom configuration and Docker isn't relying
on it for any guarantees today. For local filesystems, they would
normally be context-mounted or using genfscon rather than xattrs in
order to be accessible to the container, thus no persistent storage of
the category sets.

Certainly docker could provide an option to not reuse category sets,
but making that the default is not sane and just guarantees exhaustion
of the SID and context space (just create and tear down lots of
containers every day or more frequently).
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
On the selinux userspace side, we'd also like to eliminate the
use
of
/sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
entirely, which is what triggered this for you.
We cannot currently delete a sidtab node because we have no way of
knowing if there are any lingering references to the
SID.  Fixing
that
would require reference-counted SIDs, which goes beyond just SELinux
since SIDs/secids are returned by LSM hooks and cached in other kernel
data structures.
You could delete a sidtab node. The code already deals with unfindable
SIDs. The issue is that eventually you run out of SIDs. Then you are
forced to recycle SIDs, which leads to the overthrow of the Elbonian
government.
We don't know when we can safely delete a sidtab node since SIDs aren't
reference counted and we can't know whether it is still in use
somewhere in the kernel.  Doing so prematurely would lead to the
SID
being remapped to the unlabeled context, and then likely to
undesired
denials.
I would suggest that if you delete a sidtab node and someone
comes along later and tries to use it that denial is exactly
what you would desire. I don't see any other rational action.
Yes, if we know that the SID wasn't in use at the time we tore it down.
But if we're just randomly deleting sidtab entries based on age or
something (since we have no reference count), we'll almost certainly
encounter situations where a SID hasn't been accessed in a long time
but is still being legitimately cached somewhere. Just a file that
hasn't been accessed in a while might have that SID still cached in its
inode security blob, or anywhere else.
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
sidtab_search_context() could no doubt be optimized for the negative
case; there was an earlier optimization for the positive case
by
adding
a cache to sidtab_context_to_sid() prior to calling it.  It's a
reverse
lookup in the sidtab.
This seems like a bad idea.
Not sure what you mean, but it can certainly be changed to at least use
a hash table for these reverse lookups.
Casey Schaufler
2017-12-14 17:42:28 UTC
Permalink
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Hello, 
I am doing stressing testing on 3.10 kernel(centos 7.4), to
constantly starting numbers of docker ontainers with selinux enabled,
 <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
 [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
 [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
 [<ffffffff811224d0>] ?
watchdog_enable_all_cpus.part.4+0x40/0x40
 [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
 [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
 [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
 [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
 [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
 <EOI>  [<ffffffff812b4193>] ?
sidtab_context_to_sid+0xb3/0x480
 [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
 [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
 [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
 [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
 [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
 [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
 [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
 [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
 [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount overlay
filesystem
with different selinux context, mount point such as: 
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f
07b4
bc32
6cb07495ca08fc9ddb66/merged type overlay
s0:c
414,
c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP
ARHH
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li
b/do
cker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock
er/o
verl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f
c9dd
b66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e
952e
ae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9
1327
ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
virt
_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb
7258
cbca
14ff6d165b94353eefab/merged type overlay
s0:c
431,
c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP
AVRC
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi
r=/v
ar/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb
ca14
ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d
0801
45c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce
dc1d
cf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
virt
_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in the
sidtab
list, If not found, a new node is generated and insert into
the
list,
As the number of containers is increasing,  context nodes are also
more and more, we tested the final number of nodes reached 300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which will
lead to
the
system softlockup.
Is this a selinux bug? When filesystem umount, why context
node
is
not deleted?  I cannot find the relevant function to delete
the
node
in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category set for every
new container, never reusing them even if the container is
destroyed? 
That would be a bug in docker IMHO.  Or are you creating an
unbounded
number of containers and never destroying the older ones?
You can't reuse the security context. A process in ContainerA sends
a labeled packet to MachineB. ContainerA goes away and its
context
is recycled in ContainerC. MachineB responds some time later, again
with a labeled packet. ContainerC gets information intended for
ContainerA, and uses the information to take over the Elbonian
government.
Docker isn't using labeled networking (nor is anything else by default;
it is only enabled if explicitly configured).
If labeled networking weren't an issue we'd have full security
module stacking by now. Yes, it's an edge case. If you want to
use labeled NFS or a local filesystem that gets mounted in each
container (don't tell me that nobody would do that) you've got
the same problem.
Even if someone were to configure labeled networking, Docker is not
presently relying on that or SELinux network enforcement for any
security properties, so it really doesn't matter.
True enough. I can imagine a use case, but as you point out, it
would be a very complex configuration and coordination exercise
using SELinux.
Post by Stephen Smalley
And if they wanted
to do that, they'd have to coordinate category assignments across all
systems involved, for which no facility exists AFAIK. If you have two
docker instances running on different hosts, I'd wager that they can
hand out the same category sets today to different containers.
With respect to labeled NFS, that's also not the default for nfs
mounts, so again it is a custom configuration and Docker isn't relying
on it for any guarantees today. For local filesystems, they would
normally be context-mounted or using genfscon rather than xattrs in
order to be accessible to the container, thus no persistent storage of
the category sets.
I know that is the intended configuration, but I see people do
all sorts of stoopid things for what they believe are good reasons.
Unfortunately, lots of people count on containers to provide
isolation, but create "solutions" for data sharing that defeat it.
Post by Stephen Smalley
Certainly docker could provide an option to not reuse category sets,
but making that the default is not sane and just guarantees exhaustion
of the SID and context space (just create and tear down lots of
containers every day or more frequently).
It seems that Docker might have a similar issue with UIDs,
but it takes longer to run out of UIDs than sidtab entries.
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
On the selinux userspace side, we'd also like to eliminate the
use
of
/sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
entirely, which is what triggered this for you.
We cannot currently delete a sidtab node because we have no way of
knowing if there are any lingering references to the
SID.  Fixing
that
would require reference-counted SIDs, which goes beyond just SELinux
since SIDs/secids are returned by LSM hooks and cached in other kernel
data structures.
You could delete a sidtab node. The code already deals with
unfindable
SIDs. The issue is that eventually you run out of SIDs. Then you are
forced to recycle SIDs, which leads to the overthrow of the
Elbonian
government.
We don't know when we can safely delete a sidtab node since SIDs aren't
reference counted and we can't know whether it is still in use
somewhere in the kernel.  Doing so prematurely would lead to the
SID
being remapped to the unlabeled context, and then likely to
undesired
denials.
I would suggest that if you delete a sidtab node and someone
comes along later and tries to use it that denial is exactly
what you would desire. I don't see any other rational action.
Yes, if we know that the SID wasn't in use at the time we tore it down.
But if we're just randomly deleting sidtab entries based on age or
something (since we have no reference count), we'll almost certainly
encounter situations where a SID hasn't been accessed in a long time
but is still being legitimately cached somewhere. Just a file that
hasn't been accessed in a while might have that SID still cached in its
inode security blob, or anywhere else.
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
sidtab_search_context() could no doubt be optimized for the negative
case; there was an earlier optimization for the positive case
by
adding
a cache to sidtab_context_to_sid() prior to calling it.  It's a
reverse
lookup in the sidtab.
This seems like a bad idea.
Not sure what you mean, but it can certainly be changed to at least use
a hash table for these reverse lookups.
Daniel Walsh
2017-12-14 18:11:46 UTC
Permalink
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by yangjihong
Hello,
I am doing stressing testing on 3.10 kernel(centos 7.4), to
constantly starting numbers of docker ontainers with selinux enabled,
<IRQ> [<ffffffff810bb778>] sched_show_task+0xb8/0x120
[<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
[<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
[<ffffffff811224d0>] ?
watchdog_enable_all_cpus.part.4+0x40/0x40
[<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
[<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
[<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
[<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
[<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
<EOI> [<ffffffff812b4193>] ?
sidtab_context_to_sid+0xb3/0x480
[<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
[<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
[<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
[<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
[<ffffffff812b1960>] ? sel_write_member+0x200/0x200
[<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
[<ffffffff811f444d>] vfs_write+0xbd/0x1e0
[<ffffffff811f4eef>] SyS_write+0x7f/0xe0
[<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount overlay
filesystem
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f
07b4
bc32
6cb07495ca08fc9ddb66/merged type overlay
s0:c
414,
c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP
ARHH
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li
b/do
cker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock
er/o
verl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f
c9dd
b66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e
952e
ae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9
1327
ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
virt
_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb
7258
cbca
14ff6d165b94353eefab/merged type overlay
s0:c
431,
c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP
AVRC
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi
r=/v
ar/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb
ca14
ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d
0801
45c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce
dc1d
cf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
virt
_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in the sidtab
list, If not found, a new node is generated and insert into
the
list,
As the number of containers is increasing, context nodes are
also
more and more, we tested the final number of nodes reached 300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which will
lead to
the
system softlockup.
Is this a selinux bug? When filesystem umount, why context
node
is
not deleted? I cannot find the relevant function to delete
the
node
in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category set for every
new container, never reusing them even if the container is destroyed?
That would be a bug in docker IMHO. Or are you creating an unbounded
number of containers and never destroying the older ones?
You can't reuse the security context. A process in ContainerA sends
a labeled packet to MachineB. ContainerA goes away and its
context
is recycled in ContainerC. MachineB responds some time later, again
with a labeled packet. ContainerC gets information intended for
ContainerA, and uses the information to take over the Elbonian
government.
Docker isn't using labeled networking (nor is anything else by default;
it is only enabled if explicitly configured).
If labeled networking weren't an issue we'd have full security
module stacking by now. Yes, it's an edge case. If you want to
use labeled NFS or a local filesystem that gets mounted in each
container (don't tell me that nobody would do that) you've got
the same problem.
Even if someone were to configure labeled networking, Docker is not
presently relying on that or SELinux network enforcement for any
security properties, so it really doesn't matter.
True enough. I can imagine a use case, but as you point out, it
would be a very complex configuration and coordination exercise
using SELinux.
Post by Stephen Smalley
And if they wanted
to do that, they'd have to coordinate category assignments across all
systems involved, for which no facility exists AFAIK. If you have two
docker instances running on different hosts, I'd wager that they can
hand out the same category sets today to different containers.
With respect to labeled NFS, that's also not the default for nfs
mounts, so again it is a custom configuration and Docker isn't relying
on it for any guarantees today. For local filesystems, they would
normally be context-mounted or using genfscon rather than xattrs in
order to be accessible to the container, thus no persistent storage of
the category sets.
Well Kubernetes and OpenShift do set the labels to be the same within a
project, and they can manage
across nodes. But yes we are not using labeled networking at this point.
Post by Casey Schaufler
I know that is the intended configuration, but I see people do
all sorts of stoopid things for what they believe are good reasons.
Unfortunately, lots of people count on containers to provide
isolation, but create "solutions" for data sharing that defeat it.
Post by Stephen Smalley
Certainly docker could provide an option to not reuse category sets,
but making that the default is not sane and just guarantees exhaustion
of the SID and context space (just create and tear down lots of
containers every day or more frequently).
It seems that Docker might have a similar issue with UIDs,
but it takes longer to run out of UIDs than sidtab entries.
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
On the selinux userspace side, we'd also like to eliminate the
use
of
/sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
entirely, which is what triggered this for you.
We cannot currently delete a sidtab node because we have no way of
knowing if there are any lingering references to the
SID. Fixing
that
would require reference-counted SIDs, which goes beyond just SELinux
since SIDs/secids are returned by LSM hooks and cached in other kernel
data structures.
You could delete a sidtab node. The code already deals with unfindable
SIDs. The issue is that eventually you run out of SIDs. Then you are
forced to recycle SIDs, which leads to the overthrow of the Elbonian
government.
We don't know when we can safely delete a sidtab node since SIDs aren't
reference counted and we can't know whether it is still in use
somewhere in the kernel. Doing so prematurely would lead to the SID
being remapped to the unlabeled context, and then likely to
undesired
denials.
I would suggest that if you delete a sidtab node and someone
comes along later and tries to use it that denial is exactly
what you would desire. I don't see any other rational action.
Yes, if we know that the SID wasn't in use at the time we tore it down.
But if we're just randomly deleting sidtab entries based on age or
something (since we have no reference count), we'll almost certainly
encounter situations where a SID hasn't been accessed in a long time
but is still being legitimately cached somewhere. Just a file that
hasn't been accessed in a while might have that SID still cached in its
inode security blob, or anywhere else.
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
sidtab_search_context() could no doubt be optimized for the negative
case; there was an earlier optimization for the positive case
by
adding
a cache to sidtab_context_to_sid() prior to calling it. It's a reverse
lookup in the sidtab.
This seems like a bad idea.
Not sure what you mean, but it can certainly be changed to at least use
a hash table for these reverse lookups.
yangjihong
2017-12-15 03:09:06 UTC
Permalink
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by yangjihong
Hello,
I am doing stressing testing on 3.10 kernel(centos 7.4), to
constantly starting numbers of docker ontainers with selinux
<IRQ> [<ffffffff810bb778>] sched_show_task+0xb8/0x120
[<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
[<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
[<ffffffff811224d0>] ?
watchdog_enable_all_cpus.part.4+0x40/0x40
[<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
[<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
[<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
[<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
[<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
<EOI> [<ffffffff812b4193>] ?
sidtab_context_to_sid+0xb3/0x480
[<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
[<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
[<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
[<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
[<ffffffff812b1960>] ? sel_write_member+0x200/0x200
[<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
[<ffffffff811f444d>] vfs_write+0xbd/0x1e0
[<ffffffff811f4eef>] SyS_write+0x7f/0xe0
[<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount overlay
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f
07b4
bc32
6cb07495ca08fc9ddb66/merged type overlay
s0:c
414,
c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP
ARHH
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li
b/do
cker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock
er/o
verl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f
c9dd
b66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e
952e
ae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9
1327
ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
virt
_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb
7258
cbca
14ff6d165b94353eefab/merged type overlay
s0:c
431,
c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP
AVRC
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi
r=/v
ar/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb
ca14
ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d
0801
45c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce
dc1d
cf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
virt
_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in the sidtab
list, If not found, a new node is generated and insert into the
list, As the number of containers is increasing, context nodes
are also more and more, we tested the final number of nodes
reached
300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which will lead
to the system softlockup.
Is this a selinux bug? When filesystem umount, why context node
is not deleted? I cannot find the relevant function to delete
the node in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category set for
every new container, never reusing them even if the container is
destroyed?
That would be a bug in docker IMHO. Or are you creating an
unbounded number of containers and never destroying the older
ones?
You can't reuse the security context. A process in ContainerA
sends a labeled packet to MachineB. ContainerA goes away and its
context is recycled in ContainerC. MachineB responds some time
later, again with a labeled packet. ContainerC gets information
intended for ContainerA, and uses the information to take over the
Elbonian government.
Docker isn't using labeled networking (nor is anything else by
default; it is only enabled if explicitly configured).
If labeled networking weren't an issue we'd have full security
module stacking by now. Yes, it's an edge case. If you want to use
labeled NFS or a local filesystem that gets mounted in each
container (don't tell me that nobody would do that) you've got the
same problem.
Even if someone were to configure labeled networking, Docker is not
presently relying on that or SELinux network enforcement for any
security properties, so it really doesn't matter.
True enough. I can imagine a use case, but as you point out, it would
be a very complex configuration and coordination exercise using
SELinux.
Post by Stephen Smalley
And if they wanted
to do that, they'd have to coordinate category assignments across all
systems involved, for which no facility exists AFAIK. If you have
two docker instances running on different hosts, I'd wager that they
can hand out the same category sets today to different containers.
With respect to labeled NFS, that's also not the default for nfs
mounts, so again it is a custom configuration and Docker isn't
relying on it for any guarantees today. For local filesystems, they
would normally be context-mounted or using genfscon rather than
xattrs in order to be accessible to the container, thus no persistent
storage of the category sets.
Well Kubernetes and OpenShift do set the labels to be the same within a project, and they can manage across nodes. But yes we are not using labeled networking at this point.
I know that is the intended configuration, but I see people do all
sorts of stoopid things for what they believe are good reasons.
Unfortunately, lots of people count on containers to provide
isolation, but create "solutions" for data sharing that defeat it.
Post by Stephen Smalley
Certainly docker could provide an option to not reuse category sets,
but making that the default is not sane and just guarantees
exhaustion of the SID and context space (just create and tear down
lots of containers every day or more frequently).
It seems that Docker might have a similar issue with UIDs, but it
takes longer to run out of UIDs than sidtab entries.
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
On the selinux userspace side, we'd also like to eliminate the
use of /sys/fs/selinux/user (sel_write_user ->
security_get_user_sids) entirely, which is what triggered this
for you.
We cannot currently delete a sidtab node because we have no way
of knowing if there are any lingering references to the SID.
Fixing that would require reference-counted SIDs, which goes
beyond just SELinux since SIDs/secids are returned by LSM hooks
and cached in other kernel data structures.
You could delete a sidtab node. The code already deals with
unfindable SIDs. The issue is that eventually you run out of SIDs.
Then you are forced to recycle SIDs, which leads to the overthrow
of the Elbonian government.
We don't know when we can safely delete a sidtab node since SIDs
aren't reference counted and we can't know whether it is still in
use somewhere in the kernel. Doing so prematurely would lead to
the SID being remapped to the unlabeled context, and then likely to
undesired denials.
I would suggest that if you delete a sidtab node and someone comes
along later and tries to use it that denial is exactly what you
would desire. I don't see any other rational action.
Yes, if we know that the SID wasn't in use at the time we tore it down.
But if we're just randomly deleting sidtab entries based on age or
something (since we have no reference count), we'll almost certainly
encounter situations where a SID hasn't been accessed in a long time
but is still being legitimately cached somewhere. Just a file that
hasn't been accessed in a while might have that SID still cached in
its inode security blob, or anywhere else.
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
sidtab_search_context() could no doubt be optimized for the
negative case; there was an earlier optimization for the positive
case by adding a cache to sidtab_context_to_sid() prior to
calling it. It's a reverse lookup in the sidtab.
This seems like a bad idea.
Not sure what you mean, but it can certainly be changed to at least
use a hash table for these reverse lookups.
Thanks for reply and discussion.
I think docker container is only a case, Is it possible there is a similar way, through some means of attack, triggered a constantly increasing of SIDs list, eventually leading to the system panic?

I think the issue is that is takes too long to search SID node when SIDs list too large,
If can optimize the node's data structure(ie : tree structure) or search algorithm to ensure that traversing all nodes can be very short time even in many nodes, maybe it can solve the problem.
Or, in sidtab.c provides "delete_sidtab_node" interface, when umount fs, delete the SID node. Because when fs is umounted, the SID is useless, could delete it to control the size of SIDs list.

Thanks for reading and looking forward to your reply.

Best wishes!
Stephen Smalley
2017-12-15 13:56:46 UTC
Permalink
Post by yangjihong
Post by Daniel Walsh
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by yangjihong
Hello,
I am doing stressing testing on 3.10 kernel(centos
7.4), to 
constantly starting numbers of docker ontainers with
selinux 
enabled, and after about 2 days, the kernel
  <IRQ>  [<ffffffff810bb778>]
sched_show_task+0xb8/0x120
  [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
  [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
  [<ffffffff811224d0>] ?
watchdog_enable_all_cpus.part.4+0x40/0x40
  [<ffffffff810abf82>]
__hrtimer_run_queues+0xd2/0x260
  [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
  [<ffffffff8104a477>]
local_apic_timer_interrupt+0x37/0x60
  [<ffffffff8166fd90>]
smp_apic_timer_interrupt+0x50/0x140
  [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
  <EOI>  [<ffffffff812b4193>] ?
sidtab_context_to_sid+0xb3/0x480
  [<ffffffff812b41f0>] ?
sidtab_context_to_sid+0x110/0x480
  [<ffffffff812c0d15>] ?
mls_setup_user_range+0x145/0x250
  [<ffffffff812bd477>]
security_get_user_sids+0x3f7/0x550
  [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
  [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
  [<ffffffff812b01d8>]
selinux_transaction_write+0x48/0x80
  [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
  [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
  [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount
overlay 
filesystem with different selinux context, mount
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea
e4f6cb0f
07b4
bc32
6cb07495ca08fc9ddb66/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox
s0:c
414,
c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV
5CFWLADP
ARHH
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS
:/var/li
b/do
cker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/
lib/dock
er/o
verl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07
495ca08f
c9dd
b66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f
c4530e0e
952e
ae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755
793449c9
1327
ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs 
(rw,nosuid,nodev,noexec,relatime,context="system_u:ob
ject_r:s
virt
_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d02
55991dfb
7258
cbca
14ff6d165b94353eefab/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox
s0:c
431,
c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF
B7ANVRHP
AVRC
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI
,upperdi
r=/v
ar/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991d
fb7258cb
ca14
ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/
38d1544d
0801
45c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work
)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944
537a4bce
dc1d
cf05
a65866458523ffd4a71614/shm type tmpfs 
(rw,nosuid,nodev,noexec,relatime,context="system_u:ob
ject_r:s
virt
_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in
the sidtab 
list, If not found, a new node is generated and
insert into the 
list, As the number of containers is
increasing,  context nodes 
are also more and more, we tested the final number of
nodes 
reached
300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which
will lead 
to the system softlockup.
Is this a selinux bug? When filesystem umount, why
context node 
is not deleted?  I cannot find the relevant function
to delete 
the node in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category
set for 
every new container, never reusing them even if the
container is 
destroyed?
That would be a bug in docker IMHO.  Or are you
creating an 
unbounded number of containers and never destroying the
older 
ones?
You can't reuse the security context. A process in
ContainerA 
sends a labeled packet to MachineB. ContainerA goes away
and its 
context is recycled in ContainerC. MachineB responds some
time 
later, again with a labeled packet. ContainerC gets
information 
intended for ContainerA, and uses the information to take
over the 
Elbonian government.
Docker isn't using labeled networking (nor is anything else
by 
default; it is only enabled if explicitly configured).
If labeled networking weren't an issue we'd have full
security 
module stacking by now. Yes, it's an edge case. If you want
to use 
labeled NFS or a local filesystem that gets mounted in each 
container (don't tell me that nobody would do that) you've
got the 
same problem.
Even if someone were to configure labeled networking, Docker is
not 
presently relying on that or SELinux network enforcement for
any 
security properties, so it really doesn't matter.
True enough. I can imagine a use case, but as you point out, it
would 
be a very complex configuration and coordination exercise using 
SELinux.
Post by Stephen Smalley
And if they wanted
to do that, they'd have to coordinate category assignments
across all 
systems involved, for which no facility exists AFAIK.  If you
have 
two docker instances running on different hosts, I'd wager that
they 
can hand out the same category sets today to different
containers.
With respect to labeled NFS, that's also not the default for
nfs 
mounts, so again it is a custom configuration and Docker isn't 
relying on it for any guarantees today.  For local filesystems,
they 
would normally be context-mounted or using genfscon rather
than 
xattrs in order to be accessible to the container, thus no
persistent 
storage of the category sets.
Well Kubernetes and OpenShift do set the labels to be the same
within a project, and they can manage across nodes.  But yes we are
not using labeled networking at this point.
Post by Casey Schaufler
I know that is the intended configuration, but I see people do
all 
sorts of stoopid things for what they believe are good reasons.
Unfortunately, lots of people count on containers to provide 
isolation, but create "solutions" for data sharing that defeat it.
Post by Stephen Smalley
Certainly docker could provide an option to not reuse category
sets, 
but making that the default is not sane and just guarantees 
exhaustion of the SID and context space (just create and tear
down 
lots of containers every day or more frequently).
It seems that Docker might have a similar issue with UIDs, but
it 
takes longer to run out of UIDs than sidtab entries.
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
On the selinux userspace side, we'd also like to
eliminate the 
use of /sys/fs/selinux/user (sel_write_user -> 
security_get_user_sids) entirely, which is what
triggered this 
for you.
We cannot currently delete a sidtab node because we
have no way 
of knowing if there are any lingering references to the
SID.  
Fixing that would require reference-counted SIDs, which
goes 
beyond just SELinux since SIDs/secids are returned by
LSM hooks 
and cached in other kernel data structures.
You could delete a sidtab node. The code already deals
with 
unfindable SIDs. The issue is that eventually you run out
of SIDs. 
Then you are forced to recycle SIDs, which leads to the
overthrow 
of the Elbonian government.
We don't know when we can safely delete a sidtab node since
SIDs 
aren't reference counted and we can't know whether it is
still in 
use somewhere in the kernel.  Doing so prematurely would
lead to 
the SID being remapped to the unlabeled context, and then
likely to 
undesired denials.
I would suggest that if you delete a sidtab node and someone
comes 
along later and tries to use it that denial is exactly what
you 
would desire. I don't see any other rational action.
Yes, if we know that the SID wasn't in use at the time we tore it down.
  But if we're just randomly deleting sidtab entries based on
age or 
something (since we have no reference count), we'll almost
certainly 
encounter situations where a SID hasn't been accessed in a long
time 
but is still being legitimately cached somewhere.  Just a file
that 
hasn't been accessed in a while might have that SID still
cached in 
its inode security blob, or anywhere else.
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
sidtab_search_context() could no doubt be optimized for
the 
negative case; there was an earlier optimization for
the positive 
case by adding a cache to sidtab_context_to_sid() prior
to 
calling it.  It's a reverse lookup in the sidtab.
This seems like a bad idea.
Not sure what you mean, but it can certainly be changed to
at least 
use a hash table for these reverse lookups.
Thanks for reply and discussion.
I think docker container is only a case, Is it possible there is a
similar way, through some means of attack, triggered a constantly
increasing of  SIDs list, eventually leading to the system panic?
I think the issue is that is takes too long to search SID node when
SIDs list too large, 
If can optimize the node's data structure(ie : tree structure) or
search algorithm to ensure that traversing all nodes can be very
short time even in many nodes, maybe it can solve the problem.
Or, in sidtab.c provides "delete_sidtab_node" interface, when umount
fs, delete the SID node. Because when fs is umounted, the SID is
useless, could delete it to control the size of SIDs list.
Thanks for reading and looking forward to your reply.
We cannot safely delete entries in the sidtab without first adding
reference counting of SIDs, which goes beyond just SELinux since they
are cached in other kernel data structures and returned by LSM hooks.
That's a non-trivial undertaking.

Far more practical in the near term would be to introduce a hash table
or other mechanism for efficient reverse lookups in the sidtab. Are
you offering to implement that or just requesting it?

Independent of that, docker should support reuse of category sets when
containers are deleted, at least as an option and probably as the
default.
Daniel Walsh
2017-12-15 14:50:44 UTC
Permalink
Post by Stephen Smalley
Post by yangjihong
Post by Daniel Walsh
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by yangjihong
Hello,
I am doing stressing testing on 3.10 kernel(centos
7.4), to
constantly starting numbers of docker ontainers with
selinux
enabled, and after about 2 days, the kernel
<IRQ> [<ffffffff810bb778>]
sched_show_task+0xb8/0x120
[<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
[<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
[<ffffffff811224d0>] ?
watchdog_enable_all_cpus.part.4+0x40/0x40
[<ffffffff810abf82>]
__hrtimer_run_queues+0xd2/0x260
[<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
[<ffffffff8104a477>]
local_apic_timer_interrupt+0x37/0x60
[<ffffffff8166fd90>]
smp_apic_timer_interrupt+0x50/0x140
[<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
<EOI> [<ffffffff812b4193>] ?
sidtab_context_to_sid+0xb3/0x480
[<ffffffff812b41f0>] ?
sidtab_context_to_sid+0x110/0x480
[<ffffffff812c0d15>] ?
mls_setup_user_range+0x145/0x250
[<ffffffff812bd477>]
security_get_user_sids+0x3f7/0x550
[<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
[<ffffffff812b1960>] ? sel_write_member+0x200/0x200
[<ffffffff812b01d8>]
selinux_transaction_write+0x48/0x80
[<ffffffff811f444d>] vfs_write+0xbd/0x1e0
[<ffffffff811f4eef>] SyS_write+0x7f/0xe0
[<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount
overlay
filesystem with different selinux context, mount
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea
e4f6cb0f
07b4
bc32
6cb07495ca08fc9ddb66/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox
s0:c
414,
c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV
5CFWLADP
ARHH
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS
:/var/li
b/do
cker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/
lib/dock
er/o
verl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07
495ca08f
c9dd
b66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f
c4530e0e
952e
ae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755
793449c9
1327
ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:ob
ject_r:s
virt
_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d02
55991dfb
7258
cbca
14ff6d165b94353eefab/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox
s0:c
431,
c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF
B7ANVRHP
AVRC
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI
,upperdi
r=/v
ar/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991d
fb7258cb
ca14
ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/
38d1544d
0801
45c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work
)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944
537a4bce
dc1d
cf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:ob
ject_r:s
virt
_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in
the sidtab
list, If not found, a new node is generated and
insert into the
list, As the number of containers is
increasing, context nodes
are also more and more, we tested the final number of
nodes
reached
300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which
will lead
to the system softlockup.
Is this a selinux bug? When filesystem umount, why
context node
is not deleted? I cannot find the relevant function
to delete
the node in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category
set for
every new container, never reusing them even if the
container is
destroyed?
That would be a bug in docker IMHO. Or are you
creating an
unbounded number of containers and never destroying the older
ones?
You can't reuse the security context. A process in
ContainerA
sends a labeled packet to MachineB. ContainerA goes away and its
context is recycled in ContainerC. MachineB responds some time
later, again with a labeled packet. ContainerC gets
information
intended for ContainerA, and uses the information to take over the
Elbonian government.
Docker isn't using labeled networking (nor is anything else by
default; it is only enabled if explicitly configured).
If labeled networking weren't an issue we'd have full
security
module stacking by now. Yes, it's an edge case. If you want to use
labeled NFS or a local filesystem that gets mounted in each
container (don't tell me that nobody would do that) you've got the
same problem.
Even if someone were to configure labeled networking, Docker is not
presently relying on that or SELinux network enforcement for any
security properties, so it really doesn't matter.
True enough. I can imagine a use case, but as you point out, it would
be a very complex configuration and coordination exercise using
SELinux.
Post by Stephen Smalley
And if they wanted
to do that, they'd have to coordinate category assignments
across all
systems involved, for which no facility exists AFAIK. If you have
two docker instances running on different hosts, I'd wager that they
can hand out the same category sets today to different
containers.
With respect to labeled NFS, that's also not the default for nfs
mounts, so again it is a custom configuration and Docker isn't
relying on it for any guarantees today. For local filesystems, they
would normally be context-mounted or using genfscon rather
than
xattrs in order to be accessible to the container, thus no
persistent
storage of the category sets.
Well Kubernetes and OpenShift do set the labels to be the same
within a project, and they can manage across nodes. But yes we are
not using labeled networking at this point.
I know that is the intended configuration, but I see people do all
sorts of stoopid things for what they believe are good reasons.
Unfortunately, lots of people count on containers to provide
isolation, but create "solutions" for data sharing that defeat it.
Post by Stephen Smalley
Certainly docker could provide an option to not reuse category sets,
but making that the default is not sane and just guarantees
exhaustion of the SID and context space (just create and tear down
lots of containers every day or more frequently).
It seems that Docker might have a similar issue with UIDs, but it
takes longer to run out of UIDs than sidtab entries.
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
On the selinux userspace side, we'd also like to
eliminate the
use of /sys/fs/selinux/user (sel_write_user ->
security_get_user_sids) entirely, which is what
triggered this
for you.
We cannot currently delete a sidtab node because we
have no way
of knowing if there are any lingering references to the SID.
Fixing that would require reference-counted SIDs, which goes
beyond just SELinux since SIDs/secids are returned by
LSM hooks
and cached in other kernel data structures.
You could delete a sidtab node. The code already deals
with
unfindable SIDs. The issue is that eventually you run out of SIDs.
Then you are forced to recycle SIDs, which leads to the
overthrow
of the Elbonian government.
We don't know when we can safely delete a sidtab node since SIDs
aren't reference counted and we can't know whether it is
still in
use somewhere in the kernel. Doing so prematurely would
lead to
the SID being remapped to the unlabeled context, and then likely to
undesired denials.
I would suggest that if you delete a sidtab node and someone comes
along later and tries to use it that denial is exactly what you
would desire. I don't see any other rational action.
Yes, if we know that the SID wasn't in use at the time we tore it down.
But if we're just randomly deleting sidtab entries based on age or
something (since we have no reference count), we'll almost
certainly
encounter situations where a SID hasn't been accessed in a long time
but is still being legitimately cached somewhere. Just a file that
hasn't been accessed in a while might have that SID still
cached in
its inode security blob, or anywhere else.
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
sidtab_search_context() could no doubt be optimized for the
negative case; there was an earlier optimization for
the positive
case by adding a cache to sidtab_context_to_sid() prior to
calling it. It's a reverse lookup in the sidtab.
This seems like a bad idea.
Not sure what you mean, but it can certainly be changed to at least
use a hash table for these reverse lookups.
Thanks for reply and discussion.
I think docker container is only a case, Is it possible there is a
similar way, through some means of attack, triggered a constantly
increasing of SIDs list, eventually leading to the system panic?
I think the issue is that is takes too long to search SID node when SIDs list too large,
If can optimize the node's data structure(ie : tree structure) or
search algorithm to ensure that traversing all nodes can be very
short time even in many nodes, maybe it can solve the problem.
Or, in sidtab.c provides "delete_sidtab_node" interface, when umount
fs, delete the SID node. Because when fs is umounted, the SID is
useless, could delete it to control the size of SIDs list.
Thanks for reading and looking forward to your reply.
We cannot safely delete entries in the sidtab without first adding
reference counting of SIDs, which goes beyond just SELinux since they
are cached in other kernel data structures and returned by LSM hooks.
That's a non-trivial undertaking.
Far more practical in the near term would be to introduce a hash table
or other mechanism for efficient reverse lookups in the sidtab. Are
you offering to implement that or just requesting it?
Independent of that, docker should support reuse of category sets when
containers are deleted, at least as an option and probably as the
default.
Docker does reuse categories of containers that are removed, by default.
yangjihong
2017-12-16 10:28:45 UTC
Permalink
Post by Daniel Walsh
Post by Stephen Smalley
Post by yangjihong
Post by Daniel Walsh
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by yangjihong
Hello,
I am doing stressing testing on 3.10 kernel(centos 7.4), to
constantly starting numbers of docker ontainers with selinux
<IRQ> [<ffffffff810bb778>]
sched_show_task+0xb8/0x120
[<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
[<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
[<ffffffff811224d0>] ?
watchdog_enable_all_cpus.part.4+0x40/0x40
[<ffffffff810abf82>]
__hrtimer_run_queues+0xd2/0x260
[<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
[<ffffffff8104a477>]
local_apic_timer_interrupt+0x37/0x60
[<ffffffff8166fd90>]
smp_apic_timer_interrupt+0x50/0x140
[<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
<EOI> [<ffffffff812b4193>] ?
sidtab_context_to_sid+0xb3/0x480
[<ffffffff812b41f0>] ?
sidtab_context_to_sid+0x110/0x480
[<ffffffff812c0d15>] ?
mls_setup_user_range+0x145/0x250
[<ffffffff812bd477>]
security_get_user_sids+0x3f7/0x550
[<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
[<ffffffff812b1960>] ? sel_write_member+0x200/0x200
[<ffffffff812b01d8>]
selinux_transaction_write+0x48/0x80
[<ffffffff811f444d>] vfs_write+0xbd/0x1e0
[<ffffffff811f4eef>] SyS_write+0x7f/0xe0
[<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
when the docker container starts, it would mount overlay
overlay on
/var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea
e4f6cb0f
07b4
bc32
6cb07495ca08fc9ddb66/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox
s0:c
414,
c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV
5CFWLADP
ARHH
/var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS
:/var/li
b/do
cker
/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/
lib/dock
er/o
verl
ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07
495ca08f
c9dd
b66/
diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f
c4530e0e
952e
ae4f
6cb0f07b4bc326cb07495ca08fc9ddb66/work)
shm on
/var/lib/docker/containers/9fd65e177d2132011d7b422755
793449c9
1327
ca57
7b8f5d9d6a4adf218d4876/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:ob
ject_r:s
virt
_san
dbox_file_t:s0:c414,c873",size=65536k)
overlay on
/var/lib/docker/overlay2/38d1544d080145c7d76150530d02
55991dfb
7258
cbca
14ff6d165b94353eefab/merged type overlay
(rw,relatime,context="system_u:object_r:svirt_sandbox
s0:c
431,
c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF
B7ANVRHP
AVRC
/var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI
,upperdi
r=/v
ar/l
ib/docker/overlay2/38d1544d080145c7d76150530d0255991d
fb7258cb
ca14
ff6d
165b94353eefab/diff,workdir=/var/lib/docker/overlay2/
38d1544d
0801
45c7
d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work
)
shm on
/var/lib/docker/containers/662e7f798fc08b09eae0f0f944
537a4bce
dc1d
cf05
a65866458523ffd4a71614/shm type tmpfs
(rw,nosuid,nodev,noexec,relatime,context="system_u:ob
ject_r:s
virt
_san
dbox_file_t:s0:c431,c651",size=65536k)
sidtab_search_context check the context whether is in the
sidtab list, If not found, a new node is generated and insert
into the list, As the number of containers is increasing,
context nodes are also more and more, we tested the final
number of nodes reached
300,000 +,
sidtab_context_to_sid runtime needs 100-200ms, which will
lead to the system softlockup.
Is this a selinux bug? When filesystem umount, why context
node is not deleted? I cannot find the relevant function to
delete the node in sidtab.c
Thanks for reading and looking forward to your reply.
So, does docker just keep allocating a unique category set for
every new container, never reusing them even if the container
is destroyed?
That would be a bug in docker IMHO. Or are you creating an
unbounded number of containers and never destroying the older
ones?
You can't reuse the security context. A process in ContainerA
sends a labeled packet to MachineB. ContainerA goes away and
its context is recycled in ContainerC. MachineB responds some
time later, again with a labeled packet. ContainerC gets
information intended for ContainerA, and uses the information
to take over the Elbonian government.
Docker isn't using labeled networking (nor is anything else by
default; it is only enabled if explicitly configured).
If labeled networking weren't an issue we'd have full security
module stacking by now. Yes, it's an edge case. If you want to
use labeled NFS or a local filesystem that gets mounted in each
container (don't tell me that nobody would do that) you've got
the same problem.
Even if someone were to configure labeled networking, Docker is
not presently relying on that or SELinux network enforcement for
any security properties, so it really doesn't matter.
True enough. I can imagine a use case, but as you point out, it
would be a very complex configuration and coordination exercise
using SELinux.
Post by Stephen Smalley
And if they wanted
to do that, they'd have to coordinate category assignments across
all systems involved, for which no facility exists AFAIK. If you
have two docker instances running on different hosts, I'd wager
that they can hand out the same category sets today to different
containers.
With respect to labeled NFS, that's also not the default for nfs
mounts, so again it is a custom configuration and Docker isn't
relying on it for any guarantees today. For local filesystems,
they would normally be context-mounted or using genfscon rather
than xattrs in order to be accessible to the container, thus no
persistent storage of the category sets.
Well Kubernetes and OpenShift do set the labels to be the same
within a project, and they can manage across nodes. But yes we are
not using labeled networking at this point.
Post by Casey Schaufler
I know that is the intended configuration, but I see people do all
sorts of stoopid things for what they believe are good reasons.
Unfortunately, lots of people count on containers to provide
isolation, but create "solutions" for data sharing that defeat it.
Post by Stephen Smalley
Certainly docker could provide an option to not reuse category
sets, but making that the default is not sane and just guarantees
exhaustion of the SID and context space (just create and tear down
lots of containers every day or more frequently).
It seems that Docker might have a similar issue with UIDs, but it
takes longer to run out of UIDs than sidtab entries.
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
On the selinux userspace side, we'd also like to eliminate the
use of /sys/fs/selinux/user (sel_write_user ->
security_get_user_sids) entirely, which is what triggered this
for you.
We cannot currently delete a sidtab node because we have no
way of knowing if there are any lingering references to the
SID.
Fixing that would require reference-counted SIDs, which goes
beyond just SELinux since SIDs/secids are returned by LSM
hooks and cached in other kernel data structures.
You could delete a sidtab node. The code already deals with
unfindable SIDs. The issue is that eventually you run out of SIDs.
Then you are forced to recycle SIDs, which leads to the
overthrow of the Elbonian government.
We don't know when we can safely delete a sidtab node since SIDs
aren't reference counted and we can't know whether it is still
in use somewhere in the kernel. Doing so prematurely would lead
to the SID being remapped to the unlabeled context, and then
likely to undesired denials.
I would suggest that if you delete a sidtab node and someone
comes along later and tries to use it that denial is exactly what
you would desire. I don't see any other rational action.
Yes, if we know that the SID wasn't in use at the time we tore it down.
But if we're just randomly deleting sidtab entries based on age
or something (since we have no reference count), we'll almost
certainly encounter situations where a SID hasn't been accessed in
a long time but is still being legitimately cached somewhere.
Just a file that hasn't been accessed in a while might have that
SID still cached in its inode security blob, or anywhere else.
Post by Casey Schaufler
Post by Stephen Smalley
Post by Casey Schaufler
Post by Stephen Smalley
sidtab_search_context() could no doubt be optimized for the
negative case; there was an earlier optimization for the
positive case by adding a cache to sidtab_context_to_sid()
prior to calling it. It's a reverse lookup in the sidtab.
This seems like a bad idea.
Not sure what you mean, but it can certainly be changed to at
least use a hash table for these reverse lookups.
Thanks for reply and discussion.
I think docker container is only a case, Is it possible there is a
similar way, through some means of attack, triggered a constantly
increasing of SIDs list, eventually leading to the system panic?
I think the issue is that is takes too long to search SID node when
tree structure) or search algorithm to ensure that traversing all
nodes can be very short time even in many nodes, maybe it can solve
the problem.
Or, in sidtab.c provides "delete_sidtab_node" interface, when umount
fs, delete the SID node. Because when fs is umounted, the SID is
useless, could delete it to control the size of SIDs list.
Thanks for reading and looking forward to your reply.
We cannot safely delete entries in the sidtab without first adding
reference counting of SIDs, which goes beyond just SELinux since they
are cached in other kernel data structures and returned by LSM hooks.
That's a non-trivial undertaking.
Far more practical in the near term would be to introduce a hash table
or other mechanism for efficient reverse lookups in the sidtab. Are
you offering to implement that or just requesting it?
Because I'm not very familiar with the overall architecture of selinux, so may be could not offer to implement, sorry.
Or please tell me what I can do if I can help.
If there is any progress(ie determine the solution or optimization method), could you please inform me about it? thanks!
Post by Daniel Walsh
Post by Stephen Smalley
Independent of that, docker should support reuse of category sets when
containers are deleted, at least as an option and probably as the
default.
Docker does reuse categories of containers that are removed, by default.
Thanks for reading and looking forward to your reply.
Best wishes!

Loading...