Discussion:
[PATCH 0/7] Initial support for user namespace owned mounts
(too old to reply)
Casey Schaufler
2015-07-15 20:36:24 UTC
Permalink
These are the first in a larger set of patches that I've been working on
(with help from Eric Biederman) to support mounting ext4 and fuse
git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
Taking the series as a whole, the strategy is to handle as much of the
heavy lifting as possible in the vfs so the filesystems don't have to
handle weird edge cases. If you look at the full series you'll find that
the changes in ext4 to support user namespace mounts turn out to be
fairly minimal (fuse is a bit more complicated though as it must deal
with translating ids for a userspace process which is running in pid and
user namespaces).
The patches I'm sending today lay some of the groundwork in the vfs and
1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
pretty straightforward, and Eric has expressed interest in merging
these patches soon. Note that patch 2 won't apply cleanly without
Eric's noexec patches for proc and sys [1].
2. Patches 2-7 tighten down security for mounts with s_user_ns !=
&init_user_ns. This includes updates to how file caps and suid are
handled and LSM updates to ignore security labels on superblocks
from non-init namespaces.
The LSM changes in particular may not be optimal, as I don't have a
lot of familiarity with this code, so I'd be especially appreciative
of review of these changes and suggestions on how to improve them.
Lukasz Pawelczyk <***@samsung.com> proposed
LSM support in user namespaces ([RFC] lsm: namespace hooks)
that make a whole lot more sense than just turning off
the option of using labels on files. Gutting the ability
to use MAC in a namespace is a step down the road of
making MAC and namespaces incompatible.
Subsequent patches will update the vfs for id translation, handling
various corner cases, giving privileges to the user namsepace which owns
a superblock, and finally supporting user namespace mounts for ext4 and
fuse.
Thanks,
Seth
fs: Treat foreign mounts as nosuid
userns: Simpilify MNT_NODEV handling.
fs: Add user namesapace member to struct super_block
fs: Ignore file caps in mounts from other user namespaces
security: Restrict security attribute updates for userns mounts
selinux: Ignore security labels on user namespace mounts
smack: Don't use security labels for user namespace mounts
fs/block_dev.c | 2 +-
fs/exec.c | 2 +-
fs/namei.c | 9 ++++++++-
fs/namespace.c | 34 ++++++++++++++++++++--------------
fs/proc/root.c | 3 ++-
fs/super.c | 38 +++++++++++++++++++++++++++++++++-----
include/linux/fs.h | 9 +++++++++
include/linux/mount.h | 1 +
include/linux/user_namespace.h | 8 ++++++++
kernel/user_namespace.c | 14 ++++++++++++++
security/commoncap.c | 4 +++-
security/security.c | 10 +++++++++-
security/selinux/hooks.c | 16 +++++++++++++++-
security/smack/smack_lsm.c | 12 ++++++++++--
14 files changed, 134 insertions(+), 28 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric W. Biederman
2015-07-15 21:06:35 UTC
Permalink
Post by Casey Schaufler
These are the first in a larger set of patches that I've been working on
(with help from Eric Biederman) to support mounting ext4 and fuse
git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
Taking the series as a whole, the strategy is to handle as much of the
heavy lifting as possible in the vfs so the filesystems don't have to
handle weird edge cases. If you look at the full series you'll find that
the changes in ext4 to support user namespace mounts turn out to be
fairly minimal (fuse is a bit more complicated though as it must deal
with translating ids for a userspace process which is running in pid and
user namespaces).
The patches I'm sending today lay some of the groundwork in the vfs and
1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
pretty straightforward, and Eric has expressed interest in merging
these patches soon. Note that patch 2 won't apply cleanly without
Eric's noexec patches for proc and sys [1].
2. Patches 2-7 tighten down security for mounts with s_user_ns !=
&init_user_ns. This includes updates to how file caps and suid are
handled and LSM updates to ignore security labels on superblocks
from non-init namespaces.
The LSM changes in particular may not be optimal, as I don't have a
lot of familiarity with this code, so I'd be especially appreciative
of review of these changes and suggestions on how to improve them.
LSM support in user namespaces ([RFC] lsm: namespace hooks)
that make a whole lot more sense than just turning off
the option of using labels on files. Gutting the ability
to use MAC in a namespace is a step down the road of
making MAC and namespaces incompatible.
This is not "turning off the option to use labels on files".

This is supporting mounting filesystems like ext4 by unprivileged users
and not trusting the labels they set in the same way as we trust labels
on filesystems mounted by privileged users.

The first step needs to be not trusting those labels and treating such
filesystems as filesystems without label support. I hope that is Seth
has implemented.

In the long run we can do more interesting things with such filesystems
once the appropriate LSM policy is in place.

Getting s_user_ns present on struct super, properly set, and all of the
appropriate checks against it present in the vfs so that filesystems
don't need to duplicate logic is important if we are going do more
interesting things with user namespaces (as users have been asking for).

It is important for things as small as making it safe to allow
truly unprivileged users to mount fuse filesystems.

I am on the fence with Lukasz Pawelczyk's patches. Some parts I liked
some parts I had issues with. As I recall one of my issues was that
those patches conflicted in detail if not in principle with this
appropach.

If these patches do not do a good job of laying the ground work for
supporting security labels that unprivileged users can set than Seth
could really use some feedback. Figuring out how to properly deal with
the LSMs has been one of his challenges.

I am hoping I can finishing working through the patches to fix the
semantics of rename and bind mounts before the next merge window opens,
so I can have enough cycles to lift the feature freeze on user
namespaces. Except for maybe his first two patches (which fix a small
userspace API breakage) none of Seth's patches get to go in until I lift
the freeze.

Which is probably too much information but I hope this makes it clear
that the point of this work is as an enabler for future developments,
not as something to make user namespaces and LSMs incompatible.

Eric
Eric W. Biederman
2015-07-15 22:28:03 UTC
Permalink
Post by Eric W. Biederman
Post by Casey Schaufler
These are the first in a larger set of patches that I've been working on
(with help from Eric Biederman) to support mounting ext4 and fuse
git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
Taking the series as a whole, the strategy is to handle as much of the
heavy lifting as possible in the vfs so the filesystems don't have to
handle weird edge cases. If you look at the full series you'll find that
the changes in ext4 to support user namespace mounts turn out to be
fairly minimal (fuse is a bit more complicated though as it must deal
with translating ids for a userspace process which is running in pid and
user namespaces).
The patches I'm sending today lay some of the groundwork in the vfs and
1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
pretty straightforward, and Eric has expressed interest in merging
these patches soon. Note that patch 2 won't apply cleanly without
Eric's noexec patches for proc and sys [1].
2. Patches 2-7 tighten down security for mounts with s_user_ns !=
&init_user_ns. This includes updates to how file caps and suid are
handled and LSM updates to ignore security labels on superblocks
from non-init namespaces.
The LSM changes in particular may not be optimal, as I don't have a
lot of familiarity with this code, so I'd be especially appreciative
of review of these changes and suggestions on how to improve them.
LSM support in user namespaces ([RFC] lsm: namespace hooks)
that make a whole lot more sense than just turning off
the option of using labels on files. Gutting the ability
to use MAC in a namespace is a step down the road of
making MAC and namespaces incompatible.
This is not "turning off the option to use labels on files".
This is supporting mounting filesystems like ext4 by unprivileged users
and not trusting the labels they set in the same way as we trust labels
on filesystems mounted by privileged users.
The first step needs to be not trusting those labels and treating such
filesystems as filesystems without label support. I hope that is Seth
has implemented.
In the long run we can do more interesting things with such filesystems
once the appropriate LSM policy is in place.
Yes, this exactly. Right now it looks to me like the only safe thing to
do with mounts from unprivileged users is to ignore the security labels,
so that's what I'm trying to do with these changes. If there's some
better thing to do, or some better way to do it, I'm more than happy to
receive that feedback.
Ugh.

This made me realize that we have an interesting problem here. An
unprivileged mount of tmpfs probably needs to have
s_user_ns == &init_user_ns.

Otherwise we will break security labels on tmpfs for no good reason.
ramfs and sysfs also seem to have similar concerns.

Because they have no backing store we can trust those filesystems with
security labels. Plus for at least sysfs there is the security label
bleed through issue, that we need to make certain works.

Perhaps these filesystems with trusted backing store need to call
"sget_userns(..., &init_user_ns)".

If we don't get this right we will have significant regressions with
respect to security labels, and that is not ok.

Eric
Eric W. Biederman
2015-07-16 02:20:23 UTC
Permalink
Post by Eric W. Biederman
Post by Eric W. Biederman
Post by Casey Schaufler
These are the first in a larger set of patches that I've been working on
(with help from Eric Biederman) to support mounting ext4 and fuse
git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
Taking the series as a whole, the strategy is to handle as much of the
heavy lifting as possible in the vfs so the filesystems don't have to
handle weird edge cases. If you look at the full series you'll find that
the changes in ext4 to support user namespace mounts turn out to be
fairly minimal (fuse is a bit more complicated though as it must deal
with translating ids for a userspace process which is running in pid and
user namespaces).
The patches I'm sending today lay some of the groundwork in the vfs and
1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
pretty straightforward, and Eric has expressed interest in merging
these patches soon. Note that patch 2 won't apply cleanly without
Eric's noexec patches for proc and sys [1].
2. Patches 2-7 tighten down security for mounts with s_user_ns !=
&init_user_ns. This includes updates to how file caps and suid are
handled and LSM updates to ignore security labels on superblocks
from non-init namespaces.
The LSM changes in particular may not be optimal, as I don't have a
lot of familiarity with this code, so I'd be especially appreciative
of review of these changes and suggestions on how to improve them.
LSM support in user namespaces ([RFC] lsm: namespace hooks)
that make a whole lot more sense than just turning off
the option of using labels on files. Gutting the ability
to use MAC in a namespace is a step down the road of
making MAC and namespaces incompatible.
This is not "turning off the option to use labels on files".
This is supporting mounting filesystems like ext4 by unprivileged users
and not trusting the labels they set in the same way as we trust labels
on filesystems mounted by privileged users.
The first step needs to be not trusting those labels and treating such
filesystems as filesystems without label support. I hope that is Seth
has implemented.
In the long run we can do more interesting things with such filesystems
once the appropriate LSM policy is in place.
Yes, this exactly. Right now it looks to me like the only safe thing to
do with mounts from unprivileged users is to ignore the security labels,
so that's what I'm trying to do with these changes. If there's some
better thing to do, or some better way to do it, I'm more than happy to
receive that feedback.
Ugh.
This made me realize that we have an interesting problem here. An
unprivileged mount of tmpfs probably needs to have
s_user_ns == &init_user_ns.
Otherwise we will break security labels on tmpfs for no good reason.
ramfs and sysfs also seem to have similar concerns.
Because they have no backing store we can trust those filesystems with
security labels. Plus for at least sysfs there is the security label
bleed through issue, that we need to make certain works.
Perhaps these filesystems with trusted backing store need to call
"sget_userns(..., &init_user_ns)".
If we don't get this right we will have significant regressions with
respect to security labels, and that is not ok.
That's only a problem if there's anyone who sets security labels on
such a mount. You need global caps to do that (I hope), which
requires someone outside the userns to help, which means there's a
good chance that literally no one does this.
Fair enough. That is however something we need to test. If no one
puts security labels or file caps on such a mount we can change things.
If not we can't because it would introduce regressions.

Eric
Stephen Smalley
2015-07-16 13:12:48 UTC
Permalink
Post by Eric W. Biederman
Post by Eric W. Biederman
Post by Casey Schaufler
These are the first in a larger set of patches that I've been working on
(with help from Eric Biederman) to support mounting ext4 and fuse
git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
Taking the series as a whole, the strategy is to handle as much of the
heavy lifting as possible in the vfs so the filesystems don't have to
handle weird edge cases. If you look at the full series you'll find that
the changes in ext4 to support user namespace mounts turn out to be
fairly minimal (fuse is a bit more complicated though as it must deal
with translating ids for a userspace process which is running in pid and
user namespaces).
The patches I'm sending today lay some of the groundwork in the vfs and
1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
pretty straightforward, and Eric has expressed interest in merging
these patches soon. Note that patch 2 won't apply cleanly without
Eric's noexec patches for proc and sys [1].
2. Patches 2-7 tighten down security for mounts with s_user_ns !=
&init_user_ns. This includes updates to how file caps and suid are
handled and LSM updates to ignore security labels on superblocks
from non-init namespaces.
The LSM changes in particular may not be optimal, as I don't have a
lot of familiarity with this code, so I'd be especially appreciative
of review of these changes and suggestions on how to improve them.
LSM support in user namespaces ([RFC] lsm: namespace hooks)
that make a whole lot more sense than just turning off
the option of using labels on files. Gutting the ability
to use MAC in a namespace is a step down the road of
making MAC and namespaces incompatible.
This is not "turning off the option to use labels on files".
This is supporting mounting filesystems like ext4 by unprivileged users
and not trusting the labels they set in the same way as we trust labels
on filesystems mounted by privileged users.
The first step needs to be not trusting those labels and treating such
filesystems as filesystems without label support. I hope that is Seth
has implemented.
In the long run we can do more interesting things with such filesystems
once the appropriate LSM policy is in place.
Yes, this exactly. Right now it looks to me like the only safe thing to
do with mounts from unprivileged users is to ignore the security labels,
so that's what I'm trying to do with these changes. If there's some
better thing to do, or some better way to do it, I'm more than happy to
receive that feedback.
Ugh.
This made me realize that we have an interesting problem here. An
unprivileged mount of tmpfs probably needs to have
s_user_ns == &init_user_ns.
Otherwise we will break security labels on tmpfs for no good reason.
ramfs and sysfs also seem to have similar concerns.
Because they have no backing store we can trust those filesystems with
security labels. Plus for at least sysfs there is the security label
bleed through issue, that we need to make certain works.
Perhaps these filesystems with trusted backing store need to call
"sget_userns(..., &init_user_ns)".
If we don't get this right we will have significant regressions with
respect to security labels, and that is not ok.
That's only a problem if there's anyone who sets security labels on
such a mount. You need global caps to do that (I hope), which
requires someone outside the userns to help, which means there's a
good chance that literally no one does this.
Setting of security.selinux attributes is governed by SELinux permission
checks, not by capabilities.

Also, files are always assigned a label at creation time; a tmpfs inode
will be labeled based on its creator without any userspace entity ever
calling setxattr() at all.
Casey Schaufler
2015-07-15 22:39:03 UTC
Permalink
Post by Eric W. Biederman
Post by Casey Schaufler
These are the first in a larger set of patches that I've been working on
(with help from Eric Biederman) to support mounting ext4 and fuse
git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
Taking the series as a whole, the strategy is to handle as much of the
heavy lifting as possible in the vfs so the filesystems don't have to
handle weird edge cases. If you look at the full series you'll find that
the changes in ext4 to support user namespace mounts turn out to be
fairly minimal (fuse is a bit more complicated though as it must deal
with translating ids for a userspace process which is running in pid and
user namespaces).
The patches I'm sending today lay some of the groundwork in the vfs and
1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
pretty straightforward, and Eric has expressed interest in merging
these patches soon. Note that patch 2 won't apply cleanly without
Eric's noexec patches for proc and sys [1].
2. Patches 2-7 tighten down security for mounts with s_user_ns !=
&init_user_ns. This includes updates to how file caps and suid are
handled and LSM updates to ignore security labels on superblocks
from non-init namespaces.
The LSM changes in particular may not be optimal, as I don't have a
lot of familiarity with this code, so I'd be especially appreciative
of review of these changes and suggestions on how to improve them.
LSM support in user namespaces ([RFC] lsm: namespace hooks)
that make a whole lot more sense than just turning off
the option of using labels on files. Gutting the ability
to use MAC in a namespace is a step down the road of
making MAC and namespaces incompatible.
This is not "turning off the option to use labels on files".
It gives an unprivileged user the ability to ignore
the Smack labels that are on files and to create files
with labels that do not match the rules laid down by the
security module.
Post by Eric W. Biederman
This is supporting mounting filesystems like ext4 by unprivileged users
and not trusting the labels they set in the same way as we trust labels
on filesystems mounted by privileged users.
OK, you don't trust the metadata on a filesystem mounted by an untrusted
user. That's fair.
Post by Eric W. Biederman
The first step needs to be not trusting those labels and treating such
filesystems as filesystems without label support. I hope that is Seth
has implemented.
A filesystem with Smack labels gets mounted in a namespace. The labels
are ignored. Instead, the filesystem defaults (potentially specified as
mount options smackfsdef="something", but usually the floor label ("_"))
are used, giving the user the ability to read everything and (usually)
change nothing. This is both dangerous (unintended read access to files)
and pointless (can't make changes).

I can't speak authoritatively for SELinux, but it looks to me like you
may have similar issues there.
Post by Eric W. Biederman
In the long run we can do more interesting things with such filesystems
once the appropriate LSM policy is in place.
The problem is not that the short term behavior is uninteresting,
it's that it is broken. Mounting a filesystem with xattrs and ignoring
those xattrs results in incorrect access control decisions.
Post by Eric W. Biederman
Getting s_user_ns present on struct super, properly set, and all of the
appropriate checks against it present in the vfs so that filesystems
don't need to duplicate logic is important if we are going do more
interesting things with user namespaces (as users have been asking for).
OK, but the fact that someone wants to do something they shouldn't
doesn't mean you get to break things that work now to accommodate
them. There are reasons why mounting filesystems requires privilege!
Post by Eric W. Biederman
It is important for things as small as making it safe to allow
truly unprivileged users to mount fuse filesystems.
If it isn't safe you shouldn't be doing it, even if it's "small"
and something that would make life easier for some set of users.
Post by Eric W. Biederman
I am on the fence with Lukasz Pawelczyk's patches. Some parts I liked
some parts I had issues with. As I recall one of my issues was that
those patches conflicted in detail if not in principle with this
appropach.
If these patches do not do a good job of laying the ground work for
supporting security labels that unprivileged users can set than Seth
could really use some feedback. Figuring out how to properly deal with
the LSMs has been one of his challenges.
The feedback is that you can't pick and
choose when you are going to pay attention to the security attributes
on a filesystem. It's possible that it will work out the way you want
it, but it probably won't. Smack doesn't allow you to choose if you're
using xattrs. SELinux does, but certainly doesn't expect you to be
flipping it on and off. I'm not convinced that it's safe to do for
capability sets, either, but I'm not up to arguing PIxFE+ vector
calculations just now.
Post by Eric W. Biederman
I am hoping I can finishing working through the patches to fix the
semantics of rename and bind mounts before the next merge window opens,
so I can have enough cycles to lift the feature freeze on user
namespaces. Except for maybe his first two patches (which fix a small
userspace API breakage) none of Seth's patches get to go in until I lift
the freeze.
Thanks. I know (believe me, I know) how frustrating it can be when
you get the big NAK on something that seems like it's addressed.
Unfortunately, the proposed approach (not just the specifics of
implementation) does not work.
Post by Eric W. Biederman
Which is probably too much information but I hope this makes it clear
that the point of this work is as an enabler for future developments,
not as something to make user namespaces and LSMs incompatible.
I am paranoid, but not to the extent that I think anyone
is trying to break the interaction between security modules
and namespaces. Having worked with Lukasz on his security
namespace patches it is clear to me that this is not a simple
problem and that it is unlikely to have the simple solution
everyone would like to see. I also don't see an intermediate
state that works while the "real" solution is being refined.
As always, I'm willing to be proven wrong.
Post by Eric W. Biederman
Eric
Casey Schaufler
2015-07-16 02:54:27 UTC
Permalink
Post by Casey Schaufler
Post by Eric W. Biederman
The first step needs to be not trusting those labels and treating such
filesystems as filesystems without label support. I hope that is Seth
has implemented.
A filesystem with Smack labels gets mounted in a namespace. The labels
are ignored. Instead, the filesystem defaults (potentially specified as
mount options smackfsdef="something", but usually the floor label ("_"))
are used, giving the user the ability to read everything and (usually)
change nothing. This is both dangerous (unintended read access to files)
and pointless (can't make changes).
I don't get it.
If I mount an unprivileged filesystem, then either the contents were
put there *by me*, in which case letting me access them are fine, or
(with Seth's patches and then some) I control the backing store, in
which case I can do whatever I want regardless of what LSM thinks.
So I don't see the problem. Why would Smack or any other LSM care at
all, unless it wants to prevent me from mounting the fs in the first
place?
First off, I don't cotton to the notion that you should be able
to mount filesystems without privilege. But it seems I'm being
outvoted on that. I suspect that there are cases where it might
be safe, but I can't think of one off the top of my head.

If you do mount a filesystem it needs to behave according to the
rules of the system. If you have a security module that uses
attributes on the filesystem you can't ignore them just because
it's "your data". Mandatory access control schemes, including
Smack and SELinux don't give a fig about who you are. It's the
label on the data and the process that matter. If "you" get to
muck the labels up, you've broken the mandatory access control.
--Andy
Eric W. Biederman
2015-07-16 04:47:08 UTC
Permalink
Post by Casey Schaufler
Post by Casey Schaufler
Post by Eric W. Biederman
The first step needs to be not trusting those labels and treating such
filesystems as filesystems without label support. I hope that is Seth
has implemented.
A filesystem with Smack labels gets mounted in a namespace. The labels
are ignored. Instead, the filesystem defaults (potentially specified as
mount options smackfsdef="something", but usually the floor label ("_"))
are used, giving the user the ability to read everything and (usually)
change nothing. This is both dangerous (unintended read access to files)
and pointless (can't make changes).
I don't get it.
If I mount an unprivileged filesystem, then either the contents were
put there *by me*, in which case letting me access them are fine, or
(with Seth's patches and then some) I control the backing store, in
which case I can do whatever I want regardless of what LSM thinks.
So I don't see the problem. Why would Smack or any other LSM care at
all, unless it wants to prevent me from mounting the fs in the first
place?
First off, I don't cotton to the notion that you should be able
to mount filesystems without privilege. But it seems I'm being
outvoted on that. I suspect that there are cases where it might
be safe, but I can't think of one off the top of my head.
There are two fundamental issues mounting filesystems without privielge,
by which I actually mean mounting filesystems as the root user in a user
namespace.

- Are the semantics safe.
- Is the extra attack surface a problem.

Figuring out how to make semantics safe is what we are talking about.

Once we sort out the semantics we can look at the handful of filesystems
like fuse where the extra attack surface is not a concern.

With that said desktop environments have for a long time been
automatically mounting whichever filesystem you place in your computer,
so in practice what this is really about is trying to align the kernel
with how people use filesystems.

I haven't looked closely but I think docker is just about as bad as
those desktop environments when it comes to mounting filesystems.
Post by Casey Schaufler
If you do mount a filesystem it needs to behave according to the
rules of the system.
I agree.
Post by Casey Schaufler
If you have a security module that uses
attributes on the filesystem you can't ignore them just because
it's "your data". Mandatory access control schemes, including
Smack and SELinux don't give a fig about who you are. It's the
label on the data and the process that matter. If "you" get to
muck the labels up, you've broken the mandatory access control.
So there are filesystems like fat and minix that can not store a label.
Since it is not possible to store labels securely in filesystems mounted
by unprivileged users (at least in the normal sense) the intent would be
to treat a filesystem mounted without the privileges of the global root
user as a filesystem that does not support xattrs.

Treating such a filesystem as a filesystem that does not support xattrs
is the only possible way support such a filesystem securely, because as
you have said someone who can muck up the labels breaks mandatory access
control.

Given how non-trivial it is to grasp the nuances of different lsms
mandatory access control semantics, I am asking Seth for the first past
to simply forbid mounting of filesystems with just user namespace
permissions when there is an lsm active.

Once we get that far smack may never need to support such systems.

Eric
Dave Chinner
2015-07-17 00:09:14 UTC
Permalink
Post by Eric W. Biederman
Post by Casey Schaufler
If I mount an unprivileged filesystem, then either the contents were
put there *by me*, in which case letting me access them are fine, or
(with Seth's patches and then some) I control the backing store, in
which case I can do whatever I want regardless of what LSM thinks.
So I don't see the problem. Why would Smack or any other LSM care at
all, unless it wants to prevent me from mounting the fs in the first
place?
First off, I don't cotton to the notion that you should be able
to mount filesystems without privilege. But it seems I'm being
outvoted on that. I suspect that there are cases where it might
be safe, but I can't think of one off the top of my head.
There are two fundamental issues mounting filesystems without privielge,
by which I actually mean mounting filesystems as the root user in a user
namespace.
- Are the semantics safe.
- Is the extra attack surface a problem.
I think the attack surface this exposes is the biggest problem
facing this proposal.
Post by Eric W. Biederman
Figuring out how to make semantics safe is what we are talking about.
Once we sort out the semantics we can look at the handful of filesystems
like fuse where the extra attack surface is not a concern.
With that said desktop environments have for a long time been
automatically mounting whichever filesystem you place in your computer,
so in practice what this is really about is trying to align the kernel
with how people use filesystems.
The key difference is that desktops only do this when you physically
plug in a device. With unprivileged mounts, a hostile attacker
doesn't need physical access to the machine to exploit lurking
kernel filesystem bugs. i.e. they can just use loopback mounts, and
they can keep mounting corrupted images until they find something
that works.

User namespaces are supposed to provide trust separation. The
kernel filesystems simply aren't hardened against unprivileged
attacks from below - there is a trust relationship between root and
the filesystem in that they are the only things that can write to
the disk. Mounts from within a userns destroys this relationship as
the userns root, by definition, is not a trusted actor.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Eric W. Biederman
2015-07-17 00:42:03 UTC
Permalink
Post by Dave Chinner
Post by Eric W. Biederman
Post by Casey Schaufler
If I mount an unprivileged filesystem, then either the contents were
put there *by me*, in which case letting me access them are fine, or
(with Seth's patches and then some) I control the backing store, in
which case I can do whatever I want regardless of what LSM thinks.
So I don't see the problem. Why would Smack or any other LSM care at
all, unless it wants to prevent me from mounting the fs in the first
place?
First off, I don't cotton to the notion that you should be able
to mount filesystems without privilege. But it seems I'm being
outvoted on that. I suspect that there are cases where it might
be safe, but I can't think of one off the top of my head.
There are two fundamental issues mounting filesystems without privielge,
by which I actually mean mounting filesystems as the root user in a user
namespace.
- Are the semantics safe.
- Is the extra attack surface a problem.
I think the attack surface this exposes is the biggest problem
facing this proposal.
I completely agree.
Post by Dave Chinner
Post by Eric W. Biederman
Figuring out how to make semantics safe is what we are talking about.
Once we sort out the semantics we can look at the handful of filesystems
like fuse where the extra attack surface is not a concern.
With that said desktop environments have for a long time been
automatically mounting whichever filesystem you place in your computer,
so in practice what this is really about is trying to align the kernel
with how people use filesystems.
The key difference is that desktops only do this when you physically
plug in a device. With unprivileged mounts, a hostile attacker
doesn't need physical access to the machine to exploit lurking
kernel filesystem bugs. i.e. they can just use loopback mounts, and
they can keep mounting corrupted images until they find something
that works.
Yep. That magnifies the problem quite a bit.
Post by Dave Chinner
User namespaces are supposed to provide trust separation. The
kernel filesystems simply aren't hardened against unprivileged
attacks from below - there is a trust relationship between root and
the filesystem in that they are the only things that can write to
the disk. Mounts from within a userns destroys this relationship as
the userns root, by definition, is not a trusted actor.
I talked to Ted Tso a while back and ext4 is at least in principle
already hardened against that kind of attack. I am not certain I
believe it, but if it is true I think it is fantastic.

At this point any setting of the FS_USER_MOUNT flag I figure needs to go
through the filesystem maintainers tree and they need to be aware of and
agree to deal with the attack from below issue.

The one filesystem I truly expect we can make work is fuse. fuse has
been designed to deal with some variation of the attack from below issue
since day one. We looked at what the patches to fuse would look like
with the current state of the vfs and it was not pretty.

We very much need to sort through as much as possible at the vfs layer,
and in generic code. Allow everyone to see what is going on and how
it works before preceeding forward with enabling any filesystems.



I truly hope we can find a small set of block device filesystems that we
can harden from attack below. That would allow linux to have serious
defenses against evil usb stick attacks. I think that is going to take
a lot of careful coding, testing and validation and advancing the state
of the art to get there.

Eric
Dave Chinner
2015-07-17 02:47:35 UTC
Permalink
Post by Eric W. Biederman
Post by Dave Chinner
Post by Eric W. Biederman
Post by Casey Schaufler
If I mount an unprivileged filesystem, then either the contents were
put there *by me*, in which case letting me access them are fine, or
(with Seth's patches and then some) I control the backing store, in
which case I can do whatever I want regardless of what LSM thinks.
So I don't see the problem. Why would Smack or any other LSM care at
all, unless it wants to prevent me from mounting the fs in the first
place?
First off, I don't cotton to the notion that you should be able
to mount filesystems without privilege. But it seems I'm being
outvoted on that. I suspect that there are cases where it might
be safe, but I can't think of one off the top of my head.
There are two fundamental issues mounting filesystems without privielge,
by which I actually mean mounting filesystems as the root user in a user
namespace.
- Are the semantics safe.
- Is the extra attack surface a problem.
I think the attack surface this exposes is the biggest problem
facing this proposal.
I completely agree.
Post by Dave Chinner
Post by Eric W. Biederman
Figuring out how to make semantics safe is what we are talking about.
Once we sort out the semantics we can look at the handful of filesystems
like fuse where the extra attack surface is not a concern.
With that said desktop environments have for a long time been
automatically mounting whichever filesystem you place in your computer,
so in practice what this is really about is trying to align the kernel
with how people use filesystems.
The key difference is that desktops only do this when you physically
plug in a device. With unprivileged mounts, a hostile attacker
doesn't need physical access to the machine to exploit lurking
kernel filesystem bugs. i.e. they can just use loopback mounts, and
they can keep mounting corrupted images until they find something
that works.
Yep. That magnifies the problem quite a bit.
Post by Dave Chinner
User namespaces are supposed to provide trust separation. The
kernel filesystems simply aren't hardened against unprivileged
attacks from below - there is a trust relationship between root and
the filesystem in that they are the only things that can write to
the disk. Mounts from within a userns destroys this relationship as
the userns root, by definition, is not a trusted actor.
I talked to Ted Tso a while back and ext4 is at least in principle
already hardened against that kind of attack. I am not certain I
believe it, but if it is true I think it is fantastic.
No, it's not. No filesystem is, because to harden against such
attacks requires complete verification of all metadata when it is
read from disk, before it is used, or some method or ensuring the
block was not tampered with. CRCs are not sufficient, because they
can be tampered with, too.

The only way a filesystem would be able to trust what it reads from
disk has not been tampered with in a system with untrusted mounts is
if it has some kind of cryptographically secure signature in the
metadata and the attacker is unable to access the key for that
signature. No filesystem we have has that capability and AFAIA there
are no plans for any filesystem to implement such tamper detection.
And no, ext4 encryption does not provide this because it only stores
the values and data in encrypted format and does not protect
metadata from tampering when it is not mounted.

If we don't have crypto signatures in metadata, then XFS is probably
the most robust against tampering as it does a lot more checking of
the on-disk metadata before it is used than any other filesystem
(i.e. see the verifier infrastructure that does corruption checks
after read (in io completion) and before write (in io submission)
to catch bad metadata before it is used by the kernel, or before it
is written to disk by the kernel.

However, these checks are far from comprehensive. we can only check
internal consistency of the metadata objects in the block, and even
then we really only can check for values within range rather than
absolute correctness. e.g. we can check a dirent has a valid name,
length, ftype and inode number, but we can't validate that the inode
is actually allocated or not because that requires a lookup in the
allocated inode btree. We *trust* that inode number to be
allocated and valid because it is in metadata the filesystem wrote.

For inode numbers that come from untrusted sources (NFS,
open-by-handle, etc) we have a flag that does inode number
validation on lookup (XFS_IGET_UNTRUSTED) to check against trusted
metadata (i.e. the allocated inode btrees), but that is expensive
and so not done on inodes that we pull directly from metadata that
has come from disk. Indeed, we still trust on-disk metadata to be
correct to validate that other metadata canbe trusted, so if one
structure can be tampered with, so can others.

IOWs, if we cannot trust one part of the filesystem metadata to be
correct, then we cannot trust that filesystem *at all*, *for
anything*. And even running fsck doesn't restore trust - all it does
is tell us that any modification that was made is not a detectable
inconsistency that needs fixing.
Post by Eric W. Biederman
At this point any setting of the FS_USER_MOUNT flag I figure needs to go
through the filesystem maintainers tree and they need to be aware of and
agree to deal with the attack from below issue.
The one filesystem I truly expect we can make work is fuse. fuse has
been designed to deal with some variation of the attack from below issue
since day one. We looked at what the patches to fuse would look like
with the current state of the vfs and it was not pretty.
We very much need to sort through as much as possible at the vfs layer,
and in generic code. Allow everyone to see what is going on and how
it works before preceeding forward with enabling any filesystems.
The VFS protects us from attacks from above the filesystem, not
below. The VFS plays no part in validating the on-disk structure of
a filesystem which is what attacks from below will be attempting to
exploit.
Post by Eric W. Biederman
I truly hope we can find a small set of block device filesystems that we
can harden from attack below. That would allow linux to have serious
defenses against evil usb stick attacks. I think that is going to take
a lot of careful coding, testing and validation and advancing the state
of the art to get there.
Somehow, I just can't see that happening.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
J. Bruce Fields
2015-07-21 17:37:21 UTC
Permalink
Post by Dave Chinner
Post by Eric W. Biederman
Post by Dave Chinner
Post by Eric W. Biederman
Post by Casey Schaufler
If I mount an unprivileged filesystem, then either the contents were
put there *by me*, in which case letting me access them are fine, or
(with Seth's patches and then some) I control the backing store, in
which case I can do whatever I want regardless of what LSM thinks.
So I don't see the problem. Why would Smack or any other LSM care at
all, unless it wants to prevent me from mounting the fs in the first
place?
First off, I don't cotton to the notion that you should be able
to mount filesystems without privilege. But it seems I'm being
outvoted on that. I suspect that there are cases where it might
be safe, but I can't think of one off the top of my head.
There are two fundamental issues mounting filesystems without privielge,
by which I actually mean mounting filesystems as the root user in a user
namespace.
- Are the semantics safe.
- Is the extra attack surface a problem.
I think the attack surface this exposes is the biggest problem
facing this proposal.
I completely agree.
Post by Dave Chinner
Post by Eric W. Biederman
Figuring out how to make semantics safe is what we are talking about.
Once we sort out the semantics we can look at the handful of filesystems
like fuse where the extra attack surface is not a concern.
With that said desktop environments have for a long time been
automatically mounting whichever filesystem you place in your computer,
so in practice what this is really about is trying to align the kernel
with how people use filesystems.
The key difference is that desktops only do this when you physically
plug in a device. With unprivileged mounts, a hostile attacker
doesn't need physical access to the machine to exploit lurking
kernel filesystem bugs. i.e. they can just use loopback mounts, and
they can keep mounting corrupted images until they find something
that works.
Yep. That magnifies the problem quite a bit.
Post by Dave Chinner
User namespaces are supposed to provide trust separation. The
kernel filesystems simply aren't hardened against unprivileged
attacks from below - there is a trust relationship between root and
the filesystem in that they are the only things that can write to
the disk. Mounts from within a userns destroys this relationship as
the userns root, by definition, is not a trusted actor.
I talked to Ted Tso a while back and ext4 is at least in principle
already hardened against that kind of attack. I am not certain I
believe it, but if it is true I think it is fantastic.
No, it's not. No filesystem is, because to harden against such
attacks requires complete verification of all metadata when it is
read from disk, before it is used, or some method or ensuring the
block was not tampered with. CRCs are not sufficient, because they
can be tampered with, too.
The only way a filesystem would be able to trust what it reads from
disk has not been tampered with in a system with untrusted mounts is
if it has some kind of cryptographically secure signature in the
metadata and the attacker is unable to access the key for that
signature.
Preventing tampering is a little different from protecting the kernel
from attack, isn't it? I thought the latter was what people were asking
about.

So, for example, a screwed up on-disk directory structure shouldn't
result in creating a cycle in the dcache and then deadlocking.

--b.
Post by Dave Chinner
No filesystem we have has that capability and AFAIA there
are no plans for any filesystem to implement such tamper detection.
And no, ext4 encryption does not provide this because it only stores
the values and data in encrypted format and does not protect
metadata from tampering when it is not mounted.
If we don't have crypto signatures in metadata, then XFS is probably
the most robust against tampering as it does a lot more checking of
the on-disk metadata before it is used than any other filesystem
(i.e. see the verifier infrastructure that does corruption checks
after read (in io completion) and before write (in io submission)
to catch bad metadata before it is used by the kernel, or before it
is written to disk by the kernel.
However, these checks are far from comprehensive. we can only check
internal consistency of the metadata objects in the block, and even
then we really only can check for values within range rather than
absolute correctness. e.g. we can check a dirent has a valid name,
length, ftype and inode number, but we can't validate that the inode
is actually allocated or not because that requires a lookup in the
allocated inode btree. We *trust* that inode number to be
allocated and valid because it is in metadata the filesystem wrote.
For inode numbers that come from untrusted sources (NFS,
open-by-handle, etc) we have a flag that does inode number
validation on lookup (XFS_IGET_UNTRUSTED) to check against trusted
metadata (i.e. the allocated inode btrees), but that is expensive
and so not done on inodes that we pull directly from metadata that
has come from disk. Indeed, we still trust on-disk metadata to be
correct to validate that other metadata canbe trusted, so if one
structure can be tampered with, so can others.
IOWs, if we cannot trust one part of the filesystem metadata to be
correct, then we cannot trust that filesystem *at all*, *for
anything*. And even running fsck doesn't restore trust - all it does
is tell us that any modification that was made is not a detectable
inconsistency that needs fixing.
Post by Eric W. Biederman
At this point any setting of the FS_USER_MOUNT flag I figure needs to go
through the filesystem maintainers tree and they need to be aware of and
agree to deal with the attack from below issue.
The one filesystem I truly expect we can make work is fuse. fuse has
been designed to deal with some variation of the attack from below issue
since day one. We looked at what the patches to fuse would look like
with the current state of the vfs and it was not pretty.
We very much need to sort through as much as possible at the vfs layer,
and in generic code. Allow everyone to see what is going on and how
it works before preceeding forward with enabling any filesystems.
The VFS protects us from attacks from above the filesystem, not
below. The VFS plays no part in validating the on-disk structure of
a filesystem which is what attacks from below will be attempting to
exploit.
Post by Eric W. Biederman
I truly hope we can find a small set of block device filesystems that we
can harden from attack below. That would allow linux to have serious
defenses against evil usb stick attacks. I think that is going to take
a lot of careful coding, testing and validation and advancing the state
of the art to get there.
Somehow, I just can't see that happening.
Cheers,
Dave.
--
Dave Chinner
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2015-07-22 07:56:40 UTC
Permalink
Post by J. Bruce Fields
Post by Dave Chinner
Post by Eric W. Biederman
Post by Dave Chinner
The key difference is that desktops only do this when you physically
plug in a device. With unprivileged mounts, a hostile attacker
doesn't need physical access to the machine to exploit lurking
kernel filesystem bugs. i.e. they can just use loopback mounts, and
they can keep mounting corrupted images until they find something
that works.
Yep. That magnifies the problem quite a bit.
Post by Dave Chinner
User namespaces are supposed to provide trust separation. The
kernel filesystems simply aren't hardened against unprivileged
attacks from below - there is a trust relationship between root and
the filesystem in that they are the only things that can write to
the disk. Mounts from within a userns destroys this relationship as
the userns root, by definition, is not a trusted actor.
I talked to Ted Tso a while back and ext4 is at least in principle
already hardened against that kind of attack. I am not certain I
believe it, but if it is true I think it is fantastic.
No, it's not. No filesystem is, because to harden against such
attacks requires complete verification of all metadata when it is
read from disk, before it is used, or some method or ensuring the
block was not tampered with. CRCs are not sufficient, because they
can be tampered with, too.
The only way a filesystem would be able to trust what it reads from
disk has not been tampered with in a system with untrusted mounts is
if it has some kind of cryptographically secure signature in the
metadata and the attacker is unable to access the key for that
signature.
Preventing tampering is a little different from protecting the kernel
from attack, isn't it? I thought the latter was what people were asking
about.
People might be asking for the latter, but the only attack vector
that can be made against filesystems from below is via tampering
with the on-disk structure.

An untrusted user in an untrusted container can construct arbitrary
untrusted filesystem structures and get them parsed by a context
running as $DIETY that assumes the structure is from a trusted
source. What can possibly go wrong?

IOWs, To protect the kernel against attack from untrusted filesystem
images, we either have to be able to guarantee the image can not be
modified by untrusted parties (i.e. needs to be created with
signed tools, contain only signed filesystem metadata and
signed/encrypted data), or we have to sandbox the filesystem parsing
code completely (i.e. fuse).
Post by J. Bruce Fields
So, for example, a screwed up on-disk directory structure shouldn't
result in creating a cycle in the dcache and then deadlocking.
Therein lies the problem: how do you detect such structural defects
without doing a full structure validation? e.g. cyclic links may
only manifest when completely unrelated pieces of metadata are linked
together in a specific way.

Further, the problem is not restricted to validation at mount time -
if the user can write to the filesystem image file, then they can
modify it after it has been mounted, too. That means the attacker
may be someone who has broken into a container, not necessarily the
user you trusted with unprivileged mounts. That means every cold
metadata read needs to be treated with suspicion, not just at mount
time.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
J. Bruce Fields
2015-07-22 14:09:23 UTC
Permalink
Post by Dave Chinner
Post by J. Bruce Fields
Post by Dave Chinner
Post by Eric W. Biederman
Post by Dave Chinner
The key difference is that desktops only do this when you physically
plug in a device. With unprivileged mounts, a hostile attacker
doesn't need physical access to the machine to exploit lurking
kernel filesystem bugs. i.e. they can just use loopback mounts, and
they can keep mounting corrupted images until they find something
that works.
Yep. That magnifies the problem quite a bit.
Post by Dave Chinner
User namespaces are supposed to provide trust separation. The
kernel filesystems simply aren't hardened against unprivileged
attacks from below - there is a trust relationship between root and
the filesystem in that they are the only things that can write to
the disk. Mounts from within a userns destroys this relationship as
the userns root, by definition, is not a trusted actor.
I talked to Ted Tso a while back and ext4 is at least in principle
already hardened against that kind of attack. I am not certain I
believe it, but if it is true I think it is fantastic.
No, it's not. No filesystem is, because to harden against such
attacks requires complete verification of all metadata when it is
read from disk, before it is used, or some method or ensuring the
block was not tampered with. CRCs are not sufficient, because they
can be tampered with, too.
The only way a filesystem would be able to trust what it reads from
disk has not been tampered with in a system with untrusted mounts is
if it has some kind of cryptographically secure signature in the
metadata and the attacker is unable to access the key for that
signature.
Preventing tampering is a little different from protecting the kernel
from attack, isn't it? I thought the latter was what people were asking
about.
People might be asking for the latter, but the only attack vector
that can be made against filesystems from below is via tampering
with the on-disk structure.
An untrusted user in an untrusted container can construct arbitrary
untrusted filesystem structures and get them parsed by a context
running as $DIETY that assumes the structure is from a trusted
source. What can possibly go wrong?
IOWs, To protect the kernel against attack from untrusted filesystem
images, we either have to be able to guarantee the image can not be
modified by untrusted parties (i.e. needs to be created with
signed tools, contain only signed filesystem metadata and
signed/encrypted data),
I don't think that works--who exactly would be the "trusted party"? It
can't be this kernel or this hardware--users expect to be able to mount
filesystems created by older kernels, on other machines, running other
distributions (even other operating systems). It can't be the
user--then any user could compromise the kernel by signing a bad
filesystem.

Authenticating the creator of the filesystem might be useful for other
reasons, but it sounds to me like at best only very weak protection
against corrupted filesystems.

As a similar example, browser makers are stuck both implementing SSL and
hardening their code against malicious content. Those address separate
problems.
Post by Dave Chinner
or we have to sandbox the filesystem parsing
code completely (i.e. fuse).
Post by J. Bruce Fields
So, for example, a screwed up on-disk directory structure shouldn't
result in creating a cycle in the dcache and then deadlocking.
Therein lies the problem: how do you detect such structural defects
without doing a full structure validation?
You can prevent cycles in a graph if you can prevent adding an edge
which would be part of a cycle.

For the dcache, it's d_splice_alias that does that (using d_ancestor).

(And I believe the main motivation for that was NFS, where you don't
need a filesystem cycle, just a server-side race that can briefly make
it look like there's one--an example of the changing filesystem problem
that you point out below.)
Post by Dave Chinner
e.g. cyclic links may
only manifest when completely unrelated pieces of metadata are linked
together in a specific way.
Further, the problem is not restricted to validation at mount time -
if the user can write to the filesystem image file, then they can
modify it after it has been mounted, too. That means the attacker
may be someone who has broken into a container, not necessarily the
user you trusted with unprivileged mounts. That means every cold
metadata read needs to be treated with suspicion, not just at mount
time.
Yes. Agreed that this is difficult. (I can't actually give an example
of an existing problem of this sort, but I'd be surprised if they don't
exist.)

--b.
Austin S Hemmelgarn
2015-07-22 16:52:58 UTC
Permalink
Post by J. Bruce Fields
Post by Dave Chinner
Post by J. Bruce Fields
So, for example, a screwed up on-disk directory structure shouldn't
result in creating a cycle in the dcache and then deadlocking.
Therein lies the problem: how do you detect such structural defects
without doing a full structure validation?
You can prevent cycles in a graph if you can prevent adding an edge
which would be part of a cycle.
Except if the user can write to the filesystem's backing storage (be it
a device or a file), and has sufficient knowledge of the on-disk
structures, they can create all the cycles they want in the metadata.
So unless the kernel builds the graph internally by parsing the metadata
_and_ has some way to detect that the on-disk metadata has hit a cycle
(which may not just involve 2 items), then you still have the potential
for a DoS attack.

Trust me, I've done this before (quite a while back when I was just
starting out with programming on Linux) with hard-link cycles in an ext4
filesystem in a virtual machine just to see what would happen (IIRC,
something deadlocked, I can't remember though if it was fsck or trying
to access the file once the FS was mounted) (and in fact, I think I may
try this again just to see if anything has changed).
J. Bruce Fields
2015-07-22 17:41:00 UTC
Permalink
Post by Austin S Hemmelgarn
Post by J. Bruce Fields
Post by Dave Chinner
Post by J. Bruce Fields
So, for example, a screwed up on-disk directory structure shouldn't
result in creating a cycle in the dcache and then deadlocking.
Therein lies the problem: how do you detect such structural defects
without doing a full structure validation?
You can prevent cycles in a graph if you can prevent adding an edge
which would be part of a cycle.
Except if the user can write to the filesystem's backing storage (be
it a device or a file), and has sufficient knowledge of the on-disk
structures, they can create all the cycles they want in the
metadata. So unless the kernel builds the graph internally by
parsing the metadata _and_ has some way to detect that the on-disk
metadata has hit a cycle (which may not just involve 2 items),
Understood. Again, see the d_ancestor call in d_splice_alias, this is
exactly what it checks for.
Post by Austin S Hemmelgarn
then
you still have the potential for a DoS attack.
Trust me, I've done this before (quite a while back when I was just
starting out with programming on Linux) with hard-link cycles in an
ext4 filesystem in a virtual machine just to see what would happen
(IIRC, something deadlocked, I can't remember though if it was fsck
or trying to access the file once the FS was mounted) (and in fact,
I think I may try this again just to see if anything has changed).
I've also seen bugs caused by loops in corrupted ext4 filesystems. As
far as I know, they're fixed as of 95ad5c291313b.

(I mentioned the example of dcache loops because it's something I
happened to run across before. I'm sure there are any number of cases
where we need similar checking to keep internal data structures
consistent in the face of unexpected filesystem content.)

--b.
Dave Chinner
2015-07-23 01:51:35 UTC
Permalink
Post by J. Bruce Fields
Post by Austin S Hemmelgarn
Post by J. Bruce Fields
Post by Dave Chinner
Post by J. Bruce Fields
So, for example, a screwed up on-disk directory structure shouldn't
result in creating a cycle in the dcache and then deadlocking.
Therein lies the problem: how do you detect such structural defects
without doing a full structure validation?
You can prevent cycles in a graph if you can prevent adding an edge
which would be part of a cycle.
Except if the user can write to the filesystem's backing storage (be
it a device or a file), and has sufficient knowledge of the on-disk
structures, they can create all the cycles they want in the
metadata. So unless the kernel builds the graph internally by
parsing the metadata _and_ has some way to detect that the on-disk
metadata has hit a cycle (which may not just involve 2 items),
Understood. Again, see the d_ancestor call in d_splice_alias, this is
exactly what it checks for.
But that only addresses one type of loop in one specific metadata
structure. There's plenty of other ways you could construct metadata
loops that are essentially undetected and result in either deadlock
or livelock within the filesystem code itself. e.g. just make btree
sibling pointers loop over a range of entries that have the same
index key (e.g. free space extents of the same size). If allocation
then falls into this loop, the kernel will just spin searching the
same blocks for something it will never find. Such resource
consumption attacks are trivial to construct but extremely difficult
to detect because they exploit normal behaviour of the structure and
algorithms by mangling trusted pointers.

Of course, this sort of attack will eventually deadlock the
filesystem because it will backs up on locks held by the live locked
search. Once the filesystem is deadlocked, it can then cause sync()
calls to get stuck on the filesystem. And because sync() is a global
operation, a deadlocked filesystem in one container could cause sync
to hang in completely unrelated container....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
J. Bruce Fields
2015-07-23 13:19:28 UTC
Permalink
Post by Dave Chinner
Post by J. Bruce Fields
Post by Austin S Hemmelgarn
Post by J. Bruce Fields
Post by Dave Chinner
Post by J. Bruce Fields
So, for example, a screwed up on-disk directory structure shouldn't
result in creating a cycle in the dcache and then deadlocking.
Therein lies the problem: how do you detect such structural defects
without doing a full structure validation?
You can prevent cycles in a graph if you can prevent adding an edge
which would be part of a cycle.
Except if the user can write to the filesystem's backing storage (be
it a device or a file), and has sufficient knowledge of the on-disk
structures, they can create all the cycles they want in the
metadata. So unless the kernel builds the graph internally by
parsing the metadata _and_ has some way to detect that the on-disk
metadata has hit a cycle (which may not just involve 2 items),
Understood. Again, see the d_ancestor call in d_splice_alias, this is
exactly what it checks for.
But that only addresses one type of loop in one specific metadata
structure.
Yep, agreed!
Post by Dave Chinner
There's plenty of other ways you could construct metadata
loops that are essentially undetected and result in either deadlock
or livelock within the filesystem code itself. e.g. just make btree
sibling pointers loop over a range of entries that have the same
index key (e.g. free space extents of the same size). If allocation
then falls into this loop, the kernel will just spin searching the
same blocks for something it will never find. Such resource
consumption attacks are trivial to construct but extremely difficult
to detect because they exploit normal behaviour of the structure and
algorithms by mangling trusted pointers.
Interesting example, thanks! I doubt this particular example would be
*that* hard to detect? But understood that there may be lots of others.

--b.
Post by Dave Chinner
Of course, this sort of attack will eventually deadlock the
filesystem because it will backs up on locks held by the live locked
search. Once the filesystem is deadlocked, it can then cause sync()
calls to get stuck on the filesystem. And because sync() is a global
operation, a deadlocked filesystem in one container could cause sync
to hang in completely unrelated container....
Cheers,
Dave.
--
Dave Chinner
Dave Chinner
2015-07-23 23:48:54 UTC
Permalink
Post by J. Bruce Fields
Post by Dave Chinner
Post by J. Bruce Fields
Post by Austin S Hemmelgarn
Post by J. Bruce Fields
Post by Dave Chinner
Post by J. Bruce Fields
So, for example, a screwed up on-disk directory structure shouldn't
result in creating a cycle in the dcache and then deadlocking.
Therein lies the problem: how do you detect such structural defects
without doing a full structure validation?
You can prevent cycles in a graph if you can prevent adding an edge
which would be part of a cycle.
Except if the user can write to the filesystem's backing storage (be
it a device or a file), and has sufficient knowledge of the on-disk
structures, they can create all the cycles they want in the
metadata. So unless the kernel builds the graph internally by
parsing the metadata _and_ has some way to detect that the on-disk
metadata has hit a cycle (which may not just involve 2 items),
Understood. Again, see the d_ancestor call in d_splice_alias, this is
exactly what it checks for.
But that only addresses one type of loop in one specific metadata
structure.
Yep, agreed!
Post by Dave Chinner
There's plenty of other ways you could construct metadata
loops that are essentially undetected and result in either deadlock
or livelock within the filesystem code itself. e.g. just make btree
sibling pointers loop over a range of entries that have the same
index key (e.g. free space extents of the same size). If allocation
then falls into this loop, the kernel will just spin searching the
same blocks for something it will never find. Such resource
consumption attacks are trivial to construct but extremely difficult
to detect because they exploit normal behaviour of the structure and
algorithms by mangling trusted pointers.
Interesting example, thanks! I doubt this particular example would be
*that* hard to detect?
Yes, it can be detected, but it's not as easy as it sounds because
of abstractions between tree walking and record parsing.
Post by J. Bruce Fields
But understood that there may be lots of others.
Yeah, that's just one of many, many ways I can think of modifying
on disk structures to screw up the kernel.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Serge E. Hallyn
2015-07-18 00:07:00 UTC
Permalink
Post by Eric W. Biederman
Post by Dave Chinner
Post by Eric W. Biederman
Post by Casey Schaufler
If I mount an unprivileged filesystem, then either the contents were
put there *by me*, in which case letting me access them are fine, or
(with Seth's patches and then some) I control the backing store, in
which case I can do whatever I want regardless of what LSM thinks.
So I don't see the problem. Why would Smack or any other LSM care at
all, unless it wants to prevent me from mounting the fs in the first
place?
First off, I don't cotton to the notion that you should be able
to mount filesystems without privilege. But it seems I'm being
outvoted on that. I suspect that there are cases where it might
be safe, but I can't think of one off the top of my head.
There are two fundamental issues mounting filesystems without privielge,
by which I actually mean mounting filesystems as the root user in a user
namespace.
- Are the semantics safe.
- Is the extra attack surface a problem.
I think the attack surface this exposes is the biggest problem
facing this proposal.
I completely agree.
Post by Dave Chinner
Post by Eric W. Biederman
Figuring out how to make semantics safe is what we are talking about.
Once we sort out the semantics we can look at the handful of filesystems
like fuse where the extra attack surface is not a concern.
With that said desktop environments have for a long time been
automatically mounting whichever filesystem you place in your computer,
so in practice what this is really about is trying to align the kernel
with how people use filesystems.
The key difference is that desktops only do this when you physically
plug in a device. With unprivileged mounts, a hostile attacker
doesn't need physical access to the machine to exploit lurking
kernel filesystem bugs. i.e. they can just use loopback mounts, and
they can keep mounting corrupted images until they find something
that works.
Yep. That magnifies the problem quite a bit.
Post by Dave Chinner
User namespaces are supposed to provide trust separation. The
kernel filesystems simply aren't hardened against unprivileged
attacks from below - there is a trust relationship between root and
the filesystem in that they are the only things that can write to
the disk. Mounts from within a userns destroys this relationship as
the userns root, by definition, is not a trusted actor.
I talked to Ted Tso a while back and ext4 is at least in principle
already hardened against that kind of attack. I am not certain I
believe it, but if it is true I think it is fantastic.
Not sure what he said in private, but at the kernel summit last year
what he said was not that it was "hardened", but that any bugs which would
result from mounting a garbage image (i.e. an unpriv user fuzzing)
would be deemed by him a real bug. As opposed to saying "don't do that".

To the best of my knowledge that's so far only the case with Ted/ext4,
which I assume is why Seth started with ext4.

-serge
Colin Walters
2015-07-20 17:54:59 UTC
Permalink
Post by Eric W. Biederman
With that said desktop environments have for a long time been
automatically mounting whichever filesystem you place in your computer,
so in practice what this is really about is trying to align the kernel
with how people use filesystems.
There is a large attack surface difference between mounting a device
that someone physically plugged into the computer (and note typically
it's required that the active console be unlocked as well[1]) versus
allowing any "unprivileged" process at any time to do it.

Many server setups use "unprivileged" uids that otherwise wouldn't
be able to exploit bugs in filesystem code.

[1] https://bugzilla.gnome.org/show_bug.cgi?id=653520
"AutomountManager also keeps track of the current session availability
(using the ConsoleKit and gnome-screensaver DBus interfaces) and
inhibits mounting if the current session is locked, or another session
is in use instead."
Casey Schaufler
2015-07-15 23:04:40 UTC
Permalink
Post by Eric W. Biederman
Post by Casey Schaufler
These are the first in a larger set of patches that I've been working on
(with help from Eric Biederman) to support mounting ext4 and fuse
git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
Taking the series as a whole, the strategy is to handle as much of the
heavy lifting as possible in the vfs so the filesystems don't have to
handle weird edge cases. If you look at the full series you'll find that
the changes in ext4 to support user namespace mounts turn out to be
fairly minimal (fuse is a bit more complicated though as it must deal
with translating ids for a userspace process which is running in pid and
user namespaces).
The patches I'm sending today lay some of the groundwork in the vfs and
1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
pretty straightforward, and Eric has expressed interest in merging
these patches soon. Note that patch 2 won't apply cleanly without
Eric's noexec patches for proc and sys [1].
2. Patches 2-7 tighten down security for mounts with s_user_ns !=
&init_user_ns. This includes updates to how file caps and suid are
handled and LSM updates to ignore security labels on superblocks
from non-init namespaces.
The LSM changes in particular may not be optimal, as I don't have a
lot of familiarity with this code, so I'd be especially appreciative
of review of these changes and suggestions on how to improve them.
LSM support in user namespaces ([RFC] lsm: namespace hooks)
that make a whole lot more sense than just turning off
the option of using labels on files. Gutting the ability
to use MAC in a namespace is a step down the road of
making MAC and namespaces incompatible.
This is not "turning off the option to use labels on files".
This is supporting mounting filesystems like ext4 by unprivileged users
and not trusting the labels they set in the same way as we trust labels
on filesystems mounted by privileged users.
The first step needs to be not trusting those labels and treating such
filesystems as filesystems without label support. I hope that is Seth
has implemented.
In the long run we can do more interesting things with such filesystems
once the appropriate LSM policy is in place.
Yes, this exactly. Right now it looks to me like the only safe thing to
do with mounts from unprivileged users is to ignore the security labels,
so that's what I'm trying to do with these changes. If there's some
better thing to do, or some better way to do it, I'm more than happy to
receive that feedback.
If you ignore Smack labels you get a system that is broken.
Without specifying Smack mount options (requires CAP_MAC_ADMIN)
all your files will be labeled with the floor ("_") label. Unless
you're running with the floor label (Smack systems generally don't)
there won't be anything you can write to. You will be able to read
everything, which is also something you're unlikely to want. Like
I said, broken.

Personally, I don't believe that the goal of supporting
unprivileged mounts is especially sane. I am willing to
be educated, but I don't see a rational solution.
Seth
Lukasz Pawelczyk
2015-07-16 11:16:44 UTC
Permalink
Post by Eric W. Biederman
I am on the fence with Lukasz Pawelczyk's patches. Some parts I liked
some parts I had issues with. As I recall one of my issues was that
those patches conflicted in detail if not in principle with this
appropach.
If these patches do not do a good job of laying the ground work for
supporting security labels that unprivileged users can set than Seth
could really use some feedback. Figuring out how to properly deal with
the LSMs has been one of his challenges.
I fail to see how those 2 are in any conflict. Smack namespace is just
a mean of limiting the view of Smack labels within user namespace, to
be able to give some limited capabilities to processes in the namespace
to make it possible to partially administer Smack there. It doesn't
change Smack behaviour or mode of operation in any way.

If your approach here is to treat user ns mounted filesystem as if they
didn't support xattrs at all then my patches don't conflict here any
more than Smack itself already does.

If the filesystem will get a default (e.g. by smack* mount options)
label then this label will co-work with Smack namespaces.
--
Lukasz Pawelczyk
Samsung R&D Institute Poland
Samsung Electronics
Eric W. Biederman
2015-07-17 00:10:34 UTC
Permalink
Post by Lukasz Pawelczyk
Post by Eric W. Biederman
I am on the fence with Lukasz Pawelczyk's patches. Some parts I liked
some parts I had issues with. As I recall one of my issues was that
those patches conflicted in detail if not in principle with this
appropach.
If these patches do not do a good job of laying the ground work for
supporting security labels that unprivileged users can set than Seth
could really use some feedback. Figuring out how to properly deal with
the LSMs has been one of his challenges.
I fail to see how those 2 are in any conflict.
Like I said. They don't really conflict, and actually to really support
things well for smack we probably need something like your patches.

At the same time a patch written without dealing with s_user_ns is going
to going to fail to take a lot of important details into account.

Right now after fixing the mount namespace issues the top priority is to
work through the details and get s_user_ns implemented. By that I mean
some version of patch 1 of Seth's series.

s_user_ns fundamentally changes how the concepts are represented in the
kernel in a way that is easier to secure, and that fundamentally better
matches things. And sigh. This review has shown we don't quite have
all of the details worked out.
Post by Lukasz Pawelczyk
If your approach here is to treat user ns mounted filesystem as if they
didn't support xattrs at all then my patches don't conflict here any
more than Smack itself already does.
The end game if people developing smack choose to play, is to figure out
how to store your unmapped labels in a filesystem contained by a
user namespace and a smack label namespace root.
Post by Lukasz Pawelczyk
If the filesystem will get a default (e.g. by smack* mount options)
label then this label will co-work with Smack namespaces.
A default, but I don't know if it will be smack mount options that will
give that default. The devil is in the details and there are a lot
of details.

Eric
Lukasz Pawelczyk
2015-07-17 10:13:36 UTC
Permalink
Post by Eric W. Biederman
Post by Lukasz Pawelczyk
I fail to see how those 2 are in any conflict.
Like I said. They don't really conflict, and actually to really support
things well for smack we probably need something like your patches.
As far as I can see now from the discussion the best thing to do would
to be inherit label from a backing store object, or something along
this line.
Post by Eric W. Biederman
At the same time a patch written without dealing with s_user_ns is going
to going to fail to take a lot of important details into account.
I don't touch anything that would need to deal with s_user_ns. I also
don't change Smack's mounting logic in any way. My patches are
orthogonal to that.
Post by Eric W. Biederman
Right now after fixing the mount namespace issues the top priority is to
work through the details and get s_user_ns implemented. By that I mean
some version of patch 1 of Seth's series.
My priority is to make Smack namespace work. This is a functionality
that has a perfectly valid use case now. Without it Smack in a
container is impossible to operate on.
Post by Eric W. Biederman
s_user_ns fundamentally changes how the concepts are represented in the
kernel in a way that is easier to secure, and that fundamentally better
matches things. And sigh. This review has shown we don't quite have
all of the details worked out.
Post by Lukasz Pawelczyk
If your approach here is to treat user ns mounted filesystem as if they
didn't support xattrs at all then my patches don't conflict here any
more than Smack itself already does.
The end game if people developing smack choose to play, is to figure out
how to store your unmapped labels in a filesystem contained by a
user namespace and a smack label namespace root.
Storing an unmapped label (read: real label) in Smack namespace is
exactly the same as it is now without the namespace. I always store the
real label.

The problem here is: what real label should be "read" and eventually
stored in that filesystem (see my first comment here). Again, Smack
namespace doesn't touch that logic.
Post by Eric W. Biederman
Post by Lukasz Pawelczyk
If the filesystem will get a default (e.g. by smack* mount options)
label then this label will co-work with Smack namespaces.
A default, but I don't know if it will be smack mount options that will
give that default. The devil is in the details and there are a lot
of details.
Now Smack gives the default. If someone will modify Smack to give a
different label because of s_user_ns support Smack namepace will not
cause any hindrance here.

Smack namespace main role is only to be able to operate Smack within a
container. All the other LSM can do that already as they don't require
caps to operate normally. Smack does. Hence it had to be namespaced in
some way to give limited capabilities in a container (user ns).

This really has nothing to do with the way Smack mounts, assigns
labels, decides what is allowed and what is not, etc.

What this discussion is about is how to modify or even bend LSM's way
of work to make unprivileged user ns mounts work under LSM (or not).
Smack namespace here is just an utility within Smack itself. And maybe
it can be used to help this at some point, but beyond that it's
orthogonal to the problem.
--
Lukasz Pawelczyk
Samsung R&D Institute Poland
Samsung Electronics
Eric W. Biederman
2015-07-16 03:15:21 UTC
Permalink
Seth I think for the LSMs we should start with:

diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..5b6ece92a8e5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
int security_sb_mount(const char *dev_name, struct path *path,
const char *type, unsigned long flags, void *data)
{
+ if (current_user_ns() != &init_user_ns)
+ return -EPERM;
return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
}


Then we should push this down into all of the lsms.
Then when we should remove or relax or change the check as appropriate
in each lsm.

The point is this is good enough to see that it is trivially safe,
and this allows us to focus on the core issues, and stop worrying about
the lsms for a bit.

Then we can focus on each lsm one at at time and take the time to really
understand them and talk with their maintainers etc to make certain
we get things correct.

This should remove the need for your patches 5, 6 and 7. For the
immediate future.

Eric
Seth Forshee
2015-07-16 13:59:47 UTC
Permalink
Post by Eric W. Biederman
diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..5b6ece92a8e5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
int security_sb_mount(const char *dev_name, struct path *path,
const char *type, unsigned long flags, void *data)
{
+ if (current_user_ns() != &init_user_ns)
+ return -EPERM;
return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
}
This just makes it impossible to mount from a user namespace. Every
mount from current_user_ns() != &init_user_ns will fail.
Post by Eric W. Biederman
Then we should push this down into all of the lsms.
Then when we should remove or relax or change the check as appropriate
in each lsm.
The point is this is good enough to see that it is trivially safe,
and this allows us to focus on the core issues, and stop worrying about
the lsms for a bit.
Then we can focus on each lsm one at at time and take the time to really
understand them and talk with their maintainers etc to make certain
we get things correct.
This should remove the need for your patches 5, 6 and 7. For the
immediate future.
I'm still not entirely sure what you were trying to do, maybe refuse to
mount whenever a security module is loaded? I think this could be a good
option to start, but couldn't we restrict it to only the LSMs which use
xattrs for security labels? In situations where the filesystem cannot
supply security policy metadata I can't think of any reason to disallow
the mounts.

Seth
Casey Schaufler
2015-07-16 15:09:20 UTC
Permalink
Post by Seth Forshee
Post by Eric W. Biederman
diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..5b6ece92a8e5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
int security_sb_mount(const char *dev_name, struct path *path,
const char *type, unsigned long flags, void *data)
{
+ if (current_user_ns() != &init_user_ns)
+ return -EPERM;
return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
}
This just makes it impossible to mount from a user namespace. Every
mount from current_user_ns() != &init_user_ns will fail.
Post by Eric W. Biederman
Then we should push this down into all of the lsms.
Then when we should remove or relax or change the check as appropriate
in each lsm.
The point is this is good enough to see that it is trivially safe,
and this allows us to focus on the core issues, and stop worrying about
the lsms for a bit.
Given the extent to which LSMs are deployed I find it a bit
worrisome that they might not be considered a "core issue".
Post by Seth Forshee
Post by Eric W. Biederman
Then we can focus on each lsm one at at time and take the time to really
understand them and talk with their maintainers etc to make certain
we get things correct.
The "Do the easy stuff, fix the hard stuff after we've sold the product"
approach works really well until you get to the point of fixing the hard
stuff. This is the origin of the 90/90 rule of software development.
Post by Seth Forshee
Post by Eric W. Biederman
This should remove the need for your patches 5, 6 and 7. For the
immediate future.
I'm still not entirely sure what you were trying to do, maybe refuse to
mount whenever a security module is loaded? I think this could be a good
option to start, but couldn't we restrict it to only the LSMs which use
xattrs for security labels? In situations where the filesystem cannot
supply security policy metadata I can't think of any reason to disallow
the mounts.
This whole notion of mounting a generic filesystem (e.g. ext4) that
is "owned" by a user (as opposed to the system) has lots of implications,
and I seriously doubt that many of them have been accounted for.

Think back to the "negative group access" issue. You can't just
ignore issues that are inconvenient, or claim that you have a reasonable
system just because *you* can't think of a problem.
Post by Seth Forshee
Seth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Seth Forshee
2015-07-16 18:57:50 UTC
Permalink
Post by Casey Schaufler
Post by Seth Forshee
Post by Eric W. Biederman
diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..5b6ece92a8e5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
int security_sb_mount(const char *dev_name, struct path *path,
const char *type, unsigned long flags, void *data)
{
+ if (current_user_ns() != &init_user_ns)
+ return -EPERM;
return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
}
This just makes it impossible to mount from a user namespace. Every
mount from current_user_ns() != &init_user_ns will fail.
Post by Eric W. Biederman
Then we should push this down into all of the lsms.
Then when we should remove or relax or change the check as appropriate
in each lsm.
The point is this is good enough to see that it is trivially safe,
and this allows us to focus on the core issues, and stop worrying about
the lsms for a bit.
Given the extent to which LSMs are deployed I find it a bit
worrisome that they might not be considered a "core issue".
Post by Seth Forshee
Post by Eric W. Biederman
Then we can focus on each lsm one at at time and take the time to really
understand them and talk with their maintainers etc to make certain
we get things correct.
The "Do the easy stuff, fix the hard stuff after we've sold the product"
approach works really well until you get to the point of fixing the hard
stuff. This is the origin of the 90/90 rule of software development.
Post by Seth Forshee
Post by Eric W. Biederman
This should remove the need for your patches 5, 6 and 7. For the
immediate future.
I'm still not entirely sure what you were trying to do, maybe refuse to
mount whenever a security module is loaded? I think this could be a good
option to start, but couldn't we restrict it to only the LSMs which use
xattrs for security labels? In situations where the filesystem cannot
supply security policy metadata I can't think of any reason to disallow
the mounts.
This whole notion of mounting a generic filesystem (e.g. ext4) that
is "owned" by a user (as opposed to the system) has lots of implications,
and I seriously doubt that many of them have been accounted for.
Think back to the "negative group access" issue. You can't just
ignore issues that are inconvenient, or claim that you have a reasonable
system just because *you* can't think of a problem.
I've spent a lot of time considering the implications and previous
vulnerabilities, and I've addressed everything I turned up. Now I'm
asking for review from those with more experience with and expertise of
the code in question. I'm not sure what more I should be doing.

I welcome feedback about anything I've missed, but stating generally
that you think I probably missed something isn't very helpful.

The LSM issue is thornier than the rest of it though, which is why I
specifically asked for review there in the cover letter. There's a lot
of complexity and nuance, and I still don't have a grasp on all the
subtleties. One such subtlety is the full impact of simply ignoring the
security labels on disk (but I am still confused as to why this is
different from filesystems which don't support xattrs at all).

I was unaware of Lukasz's patches until yesterday, and I will have a
look at them. But since we don't have the LSM support for user
namespaces yet, I don't see the problem with doing something safe for
LSMs initially and evolving the LSM integration for user ns mounts along
with the rest of the user ns integration.

Your point is taken about my less-than-expert opinion about the other
security modules. We should at minimum get acks from the maintainers of
those modules that unprivileged mounts will not compromise MAC.

For Smack specifically, I believe my only concern was the SMACK64EXEC
attribute, as all the other attributes only affected subjects' access to
the files. So maybe it would be possible to simply ignore this attribute
in unprivileged mounts and respect the others, even lacking more
complete LSM support for user namespaces.

Seth
Casey Schaufler
2015-07-16 21:42:22 UTC
Permalink
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Eric W. Biederman
diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..5b6ece92a8e5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
int security_sb_mount(const char *dev_name, struct path *path,
const char *type, unsigned long flags, void *data)
{
+ if (current_user_ns() != &init_user_ns)
+ return -EPERM;
return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
}
This just makes it impossible to mount from a user namespace. Every
mount from current_user_ns() != &init_user_ns will fail.
Post by Eric W. Biederman
Then we should push this down into all of the lsms.
Then when we should remove or relax or change the check as appropriate
in each lsm.
The point is this is good enough to see that it is trivially safe,
and this allows us to focus on the core issues, and stop worrying about
the lsms for a bit.
Given the extent to which LSMs are deployed I find it a bit
worrisome that they might not be considered a "core issue".
Post by Seth Forshee
Post by Eric W. Biederman
Then we can focus on each lsm one at at time and take the time to really
understand them and talk with their maintainers etc to make certain
we get things correct.
The "Do the easy stuff, fix the hard stuff after we've sold the product"
approach works really well until you get to the point of fixing the hard
stuff. This is the origin of the 90/90 rule of software development.
Post by Seth Forshee
Post by Eric W. Biederman
This should remove the need for your patches 5, 6 and 7. For the
immediate future.
I'm still not entirely sure what you were trying to do, maybe refuse to
mount whenever a security module is loaded? I think this could be a good
option to start, but couldn't we restrict it to only the LSMs which use
xattrs for security labels? In situations where the filesystem cannot
supply security policy metadata I can't think of any reason to disallow
the mounts.
This whole notion of mounting a generic filesystem (e.g. ext4) that
is "owned" by a user (as opposed to the system) has lots of implications,
and I seriously doubt that many of them have been accounted for.
Think back to the "negative group access" issue. You can't just
ignore issues that are inconvenient, or claim that you have a reasonable
system just because *you* can't think of a problem.
I've spent a lot of time considering the implications and previous
vulnerabilities, and I've addressed everything I turned up. Now I'm
asking for review from those with more experience with and expertise of
the code in question. I'm not sure what more I should be doing.
Part of the problem I see is that you're looking at the details
when there's an architectural issue. That's OK, it happens all
the time, but we have to pull the issue up slightly higher in
order to address the underlying difficulties.

You want to provide a mechanism whereby an unprivileged user (Seth)
can mount a filesystem for his own use. You want full filesystem
semantics, but you're willing to accept restrictions on certain
filesystem features to avoid opening security holes. You are not
willing to accept restrictions that make the filesystem unusable,
such as making it read-only.

I am going to present a suggestion. Feel free to correct my
assumptions and my reasoning. For simplicity let's use loop-back
mounting of a filesystem contained in a file as an example. The
principles should apply to newly created memory based filesystems
or disk partitions "owned" by Seth.

Seth wants to mount a file (~seth/myfs) which contains an ext4
filesystem. There is already a filesystem object, with security
attributes, that the system knows how to deal with. If Seth mounts
this as a filesystem he, and potentially other people, will be
able to access the content of this object without accessing the
object itself.

seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
seth$ chmod 777 /tmp/seth
seth$ ls -la /tmp/seth
drwxrwxrwx. 3 seth seth 260 Jul 16 12:59 .
drwxrwxrwxt 18 root root 4069 Jul 16 11:13 ..
seth$

Everything's fine at this point. Wilma is also using the system,
being the sort who likes to hide things in out of the way places

wilma$ cp ~/scandals /tmp/seth
wilma$ chmod 600 /tmp/seth/scandals

puts her list of scandals on the unsuspecting filesystem, and changes
the mode to ensure that no one can find out what went on after the
office party.

Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
happened at the office party, and the story goes from there.

Wilma did everything correctly according to the system security policy,
but the system security policy did not protect her as advertised. The
system was tricked into behaving as if it was in control of the content
of the filesystem when in fact it was not.

One way to fix this problem is for unprivileged mounts to recognize the
attributes of the object mounted and to propagate those attributes to all
the objects they present. All files on /tmp/seth would be owned by seth
and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
opening a file on /tmp/seth would require the same permissions as opening
the file containing the mounted filesystem. These attributes would have to
be immutable, or at least demonstrably more restrictive (chmod might be
allowed in some cases, but chown would never be) when changed. I don't see
how a user other than seth could create a new file, as you'd either have
a magical change in ownership or a false sense of security.

I don't see that the presence of user namespaces changes anything. You
may reduce the set of uids available, but the problems with putting a
uid into someone else's file is just as real.
Post by Seth Forshee
I welcome feedback about anything I've missed, but stating generally
that you think I probably missed something isn't very helpful.
True enough. I hope I've explained myself above.
Post by Seth Forshee
The LSM issue is thornier than the rest of it though, which is why I
specifically asked for review there in the cover letter. There's a lot
of complexity and nuance, and I still don't have a grasp on all the
subtleties. One such subtlety is the full impact of simply ignoring the
security labels on disk (but I am still confused as to why this is
different from filesystems which don't support xattrs at all).
If you can mount a filesystem such that the labels are ignored you
are effectively specifying that the Smack label on the files be
determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
Without it, it's not.
Post by Seth Forshee
I was unaware of Lukasz's patches until yesterday, and I will have a
look at them. But since we don't have the LSM support for user
namespaces yet, I don't see the problem with doing something safe for
LSMs initially and evolving the LSM integration for user ns mounts along
with the rest of the user ns integration.
Ignoring the security attributes is not safe!
Post by Seth Forshee
Your point is taken about my less-than-expert opinion about the other
security modules. We should at minimum get acks from the maintainers of
those modules that unprivileged mounts will not compromise MAC.
I am the Smack maintainer. Unprivileged mounts as you have
described them compromise MAC. They compromise DAC, too.
Post by Seth Forshee
For Smack specifically, I believe my only concern was the SMACK64EXEC
attribute, as all the other attributes only affected subjects' access to
the files. So maybe it would be possible to simply ignore this attribute
in unprivileged mounts and respect the others, even lacking more
complete LSM support for user namespaces.
SMACK64EXEC is analogous to the setuid bit, but I would rather see
exec() of programs with this attribute refused that for it to be
blindly ignored.
Post by Seth Forshee
Seth
Casey Schaufler
2015-07-16 23:08:58 UTC
Permalink
Post by Casey Schaufler
You want to provide a mechanism whereby an unprivileged user (Seth)
can mount a filesystem for his own use. You want full filesystem
semantics, but you're willing to accept restrictions on certain
filesystem features to avoid opening security holes. You are not
willing to accept restrictions that make the filesystem unusable,
such as making it read-only.
I am going to present a suggestion. Feel free to correct my
assumptions and my reasoning. For simplicity let's use loop-back
mounting of a filesystem contained in a file as an example. The
principles should apply to newly created memory based filesystems
or disk partitions "owned" by Seth.
Seth wants to mount a file (~seth/myfs) which contains an ext4
filesystem. There is already a filesystem object, with security
attributes, that the system knows how to deal with. If Seth mounts
this as a filesystem he, and potentially other people, will be
able to access the content of this object without accessing the
object itself.
seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
seth$ chmod 777 /tmp/seth
seth$ ls -la /tmp/seth
drwxrwxrwx. 3 seth seth 260 Jul 16 12:59 .
drwxrwxrwxt 18 root root 4069 Jul 16 11:13 ..
seth$
Everything's fine at this point. Wilma is also using the system,
being the sort who likes to hide things in out of the way places
wilma$ cp ~/scandals /tmp/seth
wilma$ chmod 600 /tmp/seth/scandals
This is already impossible as described. Seth can only mount the
filesystem in a private mount namespace inside a user namespace that
he created. Wilma can't see it unless Seth passes an fd to Wilma and
Wilma accepts and uses it.
But you do have multiple UIDs withing your user namespace, right?
There are processes running as someone other than seth, right?
Post by Casey Schaufler
puts her list of scandals on the unsuspecting filesystem, and changes
the mode to ensure that no one can find out what went on after the
office party.
Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
happened at the office party, and the story goes from there.
Wilma did everything correctly according to the system security policy,
but the system security policy did not protect her as advertised. The
system was tricked into behaving as if it was in control of the content
of the filesystem when in fact it was not.
I would argue that, if Wilma writes to some place described by an fd
and doesn't verify where she's writing to, then she has no expectation
of privacy. After all, she could just *tell* Seth directly whatever
she wants (assuming she can communicate with Seth in the first place).
Don't ascribe either wisdom or good intentions to Wilma.
Post by Casey Schaufler
One way to fix this problem is for unprivileged mounts to recognize the
attributes of the object mounted and to propagate those attributes to all
the objects they present. All files on /tmp/seth would be owned by seth
and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
This is impossible to enforce, because Seth could use FUSE instead of ext4.
I never said that things aren't already broken. And, if you want
to ignore the potential DAC issues (read, negative groups) just
do it for the LSM xattrs.
Post by Casey Schaufler
opening a file on /tmp/seth would require the same permissions as opening
the file containing the mounted filesystem. These attributes would have to
be immutable, or at least demonstrably more restrictive (chmod might be
allowed in some cases, but chown would never be) when changed. I don't see
how a user other than seth could create a new file, as you'd either have
a magical change in ownership or a false sense of security.
This would be a very harsh restriction. Seth might legitimately want
to give a user access to a file on backing store he owns without
giving that user access to the backing store. Root on a normal system
does that all the time.
You already said that it was impossible for Wilma to get
access, so how is this more restrictive? Besides, Seth can
always set the mode on ~/seth so that Wilma can't read the
files it contains. This isn't an old problem or a novel
solution.
Post by Casey Schaufler
If you can mount a filesystem such that the labels are ignored you
are effectively specifying that the Smack label on the files be
determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
Without it, it's not.
Can you explain what the threat model is here? I don't see what it is
that you're trying to prevent.
Um, OK.
The filesystem has files with a hundred different Smack labels on it.
I mount it as an unlabeled filesystem and everything is readable by
everyone. Bad jojo.
Post by Casey Schaufler
Post by Seth Forshee
Your point is taken about my less-than-expert opinion about the other
security modules. We should at minimum get acks from the maintainers of
those modules that unprivileged mounts will not compromise MAC.
I am the Smack maintainer. Unprivileged mounts as you have
described them compromise MAC. They compromise DAC, too.
How do they compromise DAC?
Wilma's expectation (or the application running with a mapped UID)
that chmod will keep Seth out of the file.
--Andy
Casey Schaufler
2015-07-17 00:45:53 UTC
Permalink
Post by Casey Schaufler
Post by Casey Schaufler
You want to provide a mechanism whereby an unprivileged user (Seth)
can mount a filesystem for his own use. You want full filesystem
semantics, but you're willing to accept restrictions on certain
filesystem features to avoid opening security holes. You are not
willing to accept restrictions that make the filesystem unusable,
such as making it read-only.
I am going to present a suggestion. Feel free to correct my
assumptions and my reasoning. For simplicity let's use loop-back
mounting of a filesystem contained in a file as an example. The
principles should apply to newly created memory based filesystems
or disk partitions "owned" by Seth.
Seth wants to mount a file (~seth/myfs) which contains an ext4
filesystem. There is already a filesystem object, with security
attributes, that the system knows how to deal with. If Seth mounts
this as a filesystem he, and potentially other people, will be
able to access the content of this object without accessing the
object itself.
seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
seth$ chmod 777 /tmp/seth
seth$ ls -la /tmp/seth
drwxrwxrwx. 3 seth seth 260 Jul 16 12:59 .
drwxrwxrwxt 18 root root 4069 Jul 16 11:13 ..
seth$
Everything's fine at this point. Wilma is also using the system,
being the sort who likes to hide things in out of the way places
wilma$ cp ~/scandals /tmp/seth
wilma$ chmod 600 /tmp/seth/scandals
This is already impossible as described. Seth can only mount the
filesystem in a private mount namespace inside a user namespace that
he created. Wilma can't see it unless Seth passes an fd to Wilma and
Wilma accepts and uses it.
But you do have multiple UIDs withing your user namespace, right?
There are processes running as someone other than seth, right?
Only if root set it up that way. For example, root could set up
"subuids" (this is a userspace concept) that belong to Seth. These
would be uids that Seth controls and that represent subsets of Seth's
authority. Wilma wouldn't be one of these subuids unless she was
somehow part of Seth (or if root completely screwed up).
Or if root had some really unexpected and inappropriate ideas
on what qualifies as "clever". But I'll back off. It looks like
this particular objection of mine is covered.
Post by Casey Schaufler
Post by Casey Schaufler
puts her list of scandals on the unsuspecting filesystem, and changes
the mode to ensure that no one can find out what went on after the
office party.
Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
happened at the office party, and the story goes from there.
Wilma did everything correctly according to the system security policy,
but the system security policy did not protect her as advertised. The
system was tricked into behaving as if it was in control of the content
of the filesystem when in fact it was not.
I would argue that, if Wilma writes to some place described by an fd
and doesn't verify where she's writing to, then she has no expectation
of privacy. After all, she could just *tell* Seth directly whatever
she wants (assuming she can communicate with Seth in the first place).
Don't ascribe either wisdom or good intentions to Wilma.
In that case, I'll mention the futility of solving the problem, even
without user namespaces. If Wilma tells Seth something, he's going to
find out. If Wilma pokes it (in whatever form) into an fd provided by
Seth, then Seth is extremely likely to find out, regardless of what
root or the MAC owner tries to do.
I'll buy that, too. I still get queasy every time someone
tells me that passing file descriptors is a security feature.
If Wilma writes to a path that's mounted in her namespace, then, sure,
overall policy associated with her namespace (which, in your example,
is the root namespace) must apply. But Seth can't mount things into
Wilma's namespace without having CAP_SYS_ADMIN in that namespace and,
if he has CAP_SYS_ADMIN, it's already game over.
And so long as it's restricted to the namespace ...
I'm starting to get it now.
Post by Casey Schaufler
Post by Casey Schaufler
One way to fix this problem is for unprivileged mounts to recognize the
attributes of the object mounted and to propagate those attributes to all
the objects they present. All files on /tmp/seth would be owned by seth
and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
This is impossible to enforce, because Seth could use FUSE instead of ext4.
I never said that things aren't already broken. And, if you want
to ignore the potential DAC issues (read, negative groups) just
do it for the LSM xattrs.
Negative groups are a solved problem, I believe.
My position is that there's a workaround but that the
design is still fundamentally flawed.
Post by Casey Schaufler
Post by Casey Schaufler
opening a file on /tmp/seth would require the same permissions as opening
the file containing the mounted filesystem. These attributes would have to
be immutable, or at least demonstrably more restrictive (chmod might be
allowed in some cases, but chown would never be) when changed. I don't see
how a user other than seth could create a new file, as you'd either have
a magical change in ownership or a false sense of security.
This would be a very harsh restriction. Seth might legitimately want
to give a user access to a file on backing store he owns without
giving that user access to the backing store. Root on a normal system
does that all the time.
You already said that it was impossible for Wilma to get
access, so how is this more restrictive? Besides, Seth can
always set the mode on ~/seth so that Wilma can't read the
files it contains. This isn't an old problem or a novel
solution.
Seth creates a userns to sandbox himself, mounts some FUSE thing in
there, and passes an fd out for the benefit of some daemon. That
daemon had better validate the thing before using it, though.
Point. It won't, but it should.
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.

The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.

Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
Post by Casey Schaufler
Post by Casey Schaufler
If you can mount a filesystem such that the labels are ignored you
are effectively specifying that the Smack label on the files be
determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
Without it, it's not.
Can you explain what the threat model is here? I don't see what it is
that you're trying to prevent.
Um, OK.
The filesystem has files with a hundred different Smack labels on it.
I mount it as an unlabeled filesystem and everything is readable by
everyone. Bad jojo.
I still don't understand. If it's a filesystem backed by a file that
Seth has RW access to, then Seth can read everything on it, full stop.
The security labels in the filesystem are irrelevant.
Well, they can't be trusted, if that's what you mean.
That's why I'm saying that the objects exposed by mounting
this backing store need to be treated with the same security
attributes as the backing store. Fudge it for DAC if you are
so inclined, but I think it's the right way to go for MAC.
This is like saying that, if you put restrictive labels in the
filesystem that lives on /dev/sda2 and give Seth ownership of
/dev/sda2, then you expect Seth to be unable to bypass the policy
specifies by your labels.
Consider the Smack label on /dev/sda2. Smack does not care
who owns it, just what the Smack label is. Just like on
~/seth/myfs. The backing store "object" is /dev/sda2 in the
one case, ~/seth/myfs in the other, and something in the ether
for a memory based filesystem. So long as the labels of the
files exposed on the mount point match those of the backing
store "object", Smack is going to be happy. Since you're
running without privilege, you can't change the labels on
the files.

Now Seth, being the sneaky person that he is, could change
the Smack labels on the files in the backing store while it's
offline. Since he has access to the backing store, he can't
give himself more access by changing the labels within the
filesystem. He can give himself less, but I'm OK with that.
Or maybe I'm misunderstanding you.
Probably, but I'm undoubtedly doing the same.

If you're going to be at LinuxCon in Seattle we should
continue this discussion over the beverage of your choice.
Post by Casey Schaufler
Post by Casey Schaufler
Post by Seth Forshee
Your point is taken about my less-than-expert opinion about the other
security modules. We should at minimum get acks from the maintainers of
those modules that unprivileged mounts will not compromise MAC.
I am the Smack maintainer. Unprivileged mounts as you have
described them compromise MAC. They compromise DAC, too.
How do they compromise DAC?
Wilma's expectation (or the application running with a mapped UID)
that chmod will keep Seth out of the file.
That was never true. If Seth has an open fd, Wilma can chmod all day
and it won't matter. In this example, Seth owns the entire filesystem
along with its backing store.
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andy Lutomirski
2015-07-17 00:59:22 UTC
Permalink
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So what if Smack used the label of the user creating the filesystem
even for filesystems with backing store? IMO this ought to be doable
with the LSM hooks -- it certainly seems reasonable for the LSM to be
aware of who created a filesystem. In fact, I'd argue that if Smack
can't do this with the proposed LSM hooks, then the hooks are
insufficient.

Presumably Smack could also figure out what was mounted, but keep in
mind that there are filesystems like ntfs-3g out there. While ntfs-3g
logically has backing store, I don't think the kernel actually knows
about it.
Post by Casey Schaufler
Post by Casey Schaufler
Post by Casey Schaufler
If you can mount a filesystem such that the labels are ignored you
are effectively specifying that the Smack label on the files be
determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
Without it, it's not.
Can you explain what the threat model is here? I don't see what it is
that you're trying to prevent.
Um, OK.
The filesystem has files with a hundred different Smack labels on it.
I mount it as an unlabeled filesystem and everything is readable by
everyone. Bad jojo.
I still don't understand. If it's a filesystem backed by a file that
Seth has RW access to, then Seth can read everything on it, full stop.
The security labels in the filesystem are irrelevant.
Well, they can't be trusted, if that's what you mean.
That's why I'm saying that the objects exposed by mounting
this backing store need to be treated with the same security
attributes as the backing store. Fudge it for DAC if you are
so inclined, but I think it's the right way to go for MAC.
This is like saying that, if you put restrictive labels in the
filesystem that lives on /dev/sda2 and give Seth ownership of
/dev/sda2, then you expect Seth to be unable to bypass the policy
specifies by your labels.
Consider the Smack label on /dev/sda2. Smack does not care
who owns it, just what the Smack label is. Just like on
~/seth/myfs. The backing store "object" is /dev/sda2 in the
one case, ~/seth/myfs in the other, and something in the ether
for a memory based filesystem. So long as the labels of the
files exposed on the mount point match those of the backing
store "object", Smack is going to be happy. Since you're
running without privilege, you can't change the labels on
the files.
Now Seth, being the sneaky person that he is, could change
the Smack labels on the files in the backing store while it's
offline. Since he has access to the backing store, he can't
give himself more access by changing the labels within the
filesystem. He can give himself less, but I'm OK with that.
Or maybe I'm misunderstanding you.
Probably, but I'm undoubtedly doing the same.
If you're going to be at LinuxCon in Seattle we should
continue this discussion over the beverage of your choice.
There's a small but not quite zero chance I'll be there. I'll
probably be in Seoul. It's too bad that LSS and KS are in different
places this year.

--Andy
Serge E. Hallyn
2015-07-17 14:28:32 UTC
Permalink
Post by Andy Lutomirski
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So what if Smack used the label of the user creating the filesystem
even for filesystems with backing store? IMO this ought to be doable
The more usual LSM-ish way to handle this would be to ask the LSM, at
mount time, with a new security_mount_bdev_in_userns() hook, passing
it the user's label and the backing store's label (if any), and storing
the label to be used for the files. Even more LSM-ish (though risking
performance hit) would be to then have the LSM at each inode_init_security
decide whether to use that label or trust what's in the fs anyway (or
do something else). That could allow the LSM to use policy to decide
that.

Because I don't know that for all LSMs it makes sense for a 'subject'
label to be assigned to an object.
Post by Andy Lutomirski
with the LSM hooks -- it certainly seems reasonable for the LSM to be
aware of who created a filesystem. In fact, I'd argue that if Smack
can't do this with the proposed LSM hooks, then the hooks are
insufficient.
Presumably Smack could also figure out what was mounted, but keep in
mind that there are filesystems like ntfs-3g out there. While ntfs-3g
logically has backing store, I don't think the kernel actually knows
about it.
Post by Casey Schaufler
Post by Casey Schaufler
Post by Casey Schaufler
If you can mount a filesystem such that the labels are ignored you
are effectively specifying that the Smack label on the files be
determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
Without it, it's not.
Can you explain what the threat model is here? I don't see what it is
that you're trying to prevent.
Um, OK.
The filesystem has files with a hundred different Smack labels on it.
I mount it as an unlabeled filesystem and everything is readable by
everyone. Bad jojo.
I still don't understand. If it's a filesystem backed by a file that
Seth has RW access to, then Seth can read everything on it, full stop.
The security labels in the filesystem are irrelevant.
Well, they can't be trusted, if that's what you mean.
That's why I'm saying that the objects exposed by mounting
this backing store need to be treated with the same security
attributes as the backing store. Fudge it for DAC if you are
so inclined, but I think it's the right way to go for MAC.
This is like saying that, if you put restrictive labels in the
filesystem that lives on /dev/sda2 and give Seth ownership of
/dev/sda2, then you expect Seth to be unable to bypass the policy
specifies by your labels.
Consider the Smack label on /dev/sda2. Smack does not care
who owns it, just what the Smack label is. Just like on
~/seth/myfs. The backing store "object" is /dev/sda2 in the
one case, ~/seth/myfs in the other, and something in the ether
for a memory based filesystem. So long as the labels of the
files exposed on the mount point match those of the backing
store "object", Smack is going to be happy. Since you're
running without privilege, you can't change the labels on
the files.
Now Seth, being the sneaky person that he is, could change
the Smack labels on the files in the backing store while it's
offline. Since he has access to the backing store, he can't
give himself more access by changing the labels within the
filesystem. He can give himself less, but I'm OK with that.
Or maybe I'm misunderstanding you.
Probably, but I'm undoubtedly doing the same.
If you're going to be at LinuxCon in Seattle we should
continue this discussion over the beverage of your choice.
There's a small but not quite zero chance I'll be there. I'll
probably be in Seoul. It's too bad that LSS and KS are in different
places this year.
FWIW I'll be there and happy to discuss.

-serge
Seth Forshee
2015-07-17 14:56:57 UTC
Permalink
Post by Serge E. Hallyn
Post by Andy Lutomirski
Post by Casey Schaufler
If you're going to be at LinuxCon in Seattle we should
continue this discussion over the beverage of your choice.
There's a small but not quite zero chance I'll be there. I'll
probably be in Seoul. It's too bad that LSS and KS are in different
places this year.
FWIW I'll be there and happy to discuss.
I'll also be in Seattle and happy to discuss.

Seth
Seth Forshee
2015-07-21 20:35:50 UTC
Permalink
Post by Andy Lutomirski
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So something like the diff below (untested)?

All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate(). Since a user currently needs CAP_MAC_ADMIN in
init_user_ns to store security labels it looks like this should be
sufficient. I'm not even sure that the inode_alloc_security hook changes
are needed.

We could allow privileged users in s_user_ns to write security labels to
disk since they already control the backing store, as long as Smack
didn't subsequently import them. I didn't do that here.
Post by Andy Lutomirski
So what if Smack used the label of the user creating the filesystem
even for filesystems with backing store? IMO this ought to be doable
with the LSM hooks -- it certainly seems reasonable for the LSM to be
aware of who created a filesystem. In fact, I'd argue that if Smack
can't do this with the proposed LSM hooks, then the hooks are
insufficient.
It would be very simple to use the label of the task instead.

Seth

---

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}

+static inline bool sb_in_userns(struct super_block *sb)
+{
+ return sb->s_user_ns != &init_user_ns;
+}

extern bool inode_owner_or_capable(const struct inode *inode);

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;

+ /* Should never fetch xattrs from untrusted mounts */
+ if (WARN_ON(sb_in_userns(ip->i_sb)))
+ return ERR_PTR(-EPERM);
+
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);

@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
*/
if (specified)
return -EPERM;
+
/*
- * Unprivileged mounts get root and default from the caller.
+ * User namespace mounts get root and default from the backing
+ * store, if there is one. Other unprivileged mounts get them
+ * from the caller.
*/
- skp = smk_of_current();
+ skp = (sb_in_userns(sb) && sb->s_bdev) ?
+ smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
*/
static int smack_inode_alloc_security(struct inode *inode)
{
- struct smack_known *skp = smk_of_current();
+ struct smack_known *skp;
+
+ if (sb_in_userns(inode->i_sb))
+ skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+ else
+ skp = smk_of_current();

inode->i_security = new_inode_smack(skp);
if (inode->i_security == NULL)
@@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
break;
}
/*
+ * Don't use labels from xattrs for unprivileged mounts.
+ */
+ if (sb_in_userns(inode->i_sb))
+ break;
+ /*
* No xattr support means, alas, no SMACK label.
* Use the aforeapplied default.
* It would be curious if the label of the task
Casey Schaufler
2015-07-22 01:52:31 UTC
Permalink
Post by Seth Forshee
Post by Andy Lutomirski
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So something like the diff below (untested)?
I think that this is close, and quite good for someone
who isn't very familiar with Smack. It's definitely headed
in the right direction.
Post by Seth Forshee
All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate().
Let's say your backing store is a file labeled Rubble.

mount -o smackfsroot=Rubble,smackfsdef=Rubble ...

It is completely reasonable for a process labeled Flintstone to
have rwxa access to a file labeled Rubble.

Smack rule: Flintstone Rubble rwxa

In the case of writing to an existing Rubble file, what you
have looks fine. What's not so great is that if the Flintstone
process creates a file, it should be labeled Flintstone. Your
use of the smk_default, which is going to violate the principle
of least astonishment, and break the Smack policy as well.

Let's make a minor change. Instead of using smackfsroot let's
use smackfstransmute and a slightly different access rule:

mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...

Smack rule: Flintstone Rubble rwxat

Now the only change we have to make to the Smack code is
that we don't want to create any files unless either the
process is labeled Rubble or the rule allowing the creation
has the "t" for transmute access. That should ensure that
everything is labeled Rubble. If it isn't, someone has mucked
with the metadata in a detectable way.
Post by Seth Forshee
Since a user currently needs CAP_MAC_ADMIN in
init_user_ns to store security labels it looks like this should be
sufficient. I'm not even sure that the inode_alloc_security hook changes
are needed.
We could allow privileged users in s_user_ns to write security labels to
disk since they already control the backing store, as long as Smack
didn't subsequently import them. I didn't do that here.
Post by Andy Lutomirski
So what if Smack used the label of the user creating the filesystem
even for filesystems with backing store? IMO this ought to be doable
with the LSM hooks -- it certainly seems reasonable for the LSM to be
aware of who created a filesystem. In fact, I'd argue that if Smack
can't do this with the proposed LSM hooks, then the hooks are
insufficient.
It would be very simple to use the label of the task instead.
Seth
---
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}
+static inline bool sb_in_userns(struct super_block *sb)
+{
+ return sb->s_user_ns != &init_user_ns;
+}
extern bool inode_owner_or_capable(const struct inode *inode);
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;
+ /* Should never fetch xattrs from untrusted mounts */
+ if (WARN_ON(sb_in_userns(ip->i_sb)))
+ return ERR_PTR(-EPERM);
+
Go ahead and fetch it, we'll check to make sure it's viable later.
Post by Seth Forshee
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);
@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
*/
if (specified)
return -EPERM;
+
/*
- * Unprivileged mounts get root and default from the caller.
+ * User namespace mounts get root and default from the backing
+ * store, if there is one. Other unprivileged mounts get them
+ * from the caller.
*/
- skp = smk_of_current();
+ skp = (sb_in_userns(sb) && sb->s_bdev) ?
+ smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
sp->smk_flags |= SMK_INODE_TRANSMUTE;
Post by Seth Forshee
}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
*/
static int smack_inode_alloc_security(struct inode *inode)
{
- struct smack_known *skp = smk_of_current();
+ struct smack_known *skp;
+
+ if (sb_in_userns(inode->i_sb))
+ skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+ else
+ skp = smk_of_current();
This should be left alone.
smack_inode_init_security is where you could disallow access that doesn't
legitimately result in a Rubble label on the file. It's something like

... after the call may = smk_access_entry(...)
if (sb_in_userns(inode->i_sb))
if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
return -EACCES;
Post by Seth Forshee
inode->i_security = new_inode_smack(skp);
if (inode->i_security == NULL)
@@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
break;
}
/*
+ * Don't use labels from xattrs for unprivileged mounts.
+ */
+ if (sb_in_userns(inode->i_sb))
+ break;
+ /*
Again, use the label. Just check to make sure it's what you expect.
Post by Seth Forshee
* No xattr support means, alas, no SMACK label.
* Use the aforeapplied default.
* It would be curious if the label of the task
Also untested.
Post by Seth Forshee
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Seth Forshee
2015-07-22 15:56:34 UTC
Permalink
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So something like the diff below (untested)?
I think that this is close, and quite good for someone
who isn't very familiar with Smack. It's definitely headed
in the right direction.
Post by Seth Forshee
All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate().
Let's say your backing store is a file labeled Rubble.
mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
It is completely reasonable for a process labeled Flintstone to
have rwxa access to a file labeled Rubble.
Smack rule: Flintstone Rubble rwxa
In the case of writing to an existing Rubble file, what you
have looks fine. What's not so great is that if the Flintstone
process creates a file, it should be labeled Flintstone. Your
use of the smk_default, which is going to violate the principle
of least astonishment, and break the Smack policy as well.
Let's make a minor change. Instead of using smackfsroot let's
mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
Smack rule: Flintstone Rubble rwxat
Now the only change we have to make to the Smack code is
that we don't want to create any files unless either the
process is labeled Rubble or the rule allowing the creation
has the "t" for transmute access. That should ensure that
everything is labeled Rubble. If it isn't, someone has mucked
with the metadata in a detectable way.
All right, that kind of makes sense, but I'm still missing some pieces.
Questions follow.
Post by Casey Schaufler
Post by Seth Forshee
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}
+static inline bool sb_in_userns(struct super_block *sb)
+{
+ return sb->s_user_ns != &init_user_ns;
+}
extern bool inode_owner_or_capable(const struct inode *inode);
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;
+ /* Should never fetch xattrs from untrusted mounts */
+ if (WARN_ON(sb_in_userns(ip->i_sb)))
+ return ERR_PTR(-EPERM);
+
Go ahead and fetch it, we'll check to make sure it's viable later.
Post by Seth Forshee
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);
@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
*/
if (specified)
return -EPERM;
+
/*
- * Unprivileged mounts get root and default from the caller.
+ * User namespace mounts get root and default from the backing
+ * store, if there is one. Other unprivileged mounts get them
+ * from the caller.
*/
- skp = smk_of_current();
+ skp = (sb_in_userns(sb) && sb->s_bdev) ?
+ smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
sp->smk_flags |= SMK_INODE_TRANSMUTE;
I assume that you meant skp and not sp here.
Post by Casey Schaufler
Post by Seth Forshee
}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
*/
static int smack_inode_alloc_security(struct inode *inode)
{
- struct smack_known *skp = smk_of_current();
+ struct smack_known *skp;
+
+ if (sb_in_userns(inode->i_sb))
+ skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+ else
+ skp = smk_of_current();
This should be left alone.
smack_inode_init_security is where you could disallow access that doesn't
legitimately result in a Rubble label on the file. It's something like
... after the call may = smk_access_entry(...)
if (sb_in_userns(inode->i_sb))
if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
return -EACCES;
I'm not getting how this covers all cases.

So we've set the transmute flag on the root inode. Files and directories
created in the root directory get the same label, and directories also
get the transmute attribute. That's all fine.

What about an existing directory in the filesystem that already has a
Slate label? I'm not getting what happens with this directory, or for
new files created in this directory, which also relates to my other
questions below.

Also an aside - smk_access_entry looks weird. may is initialized to
-ENOENT, and then rule_list is searched for a rule which matches the
object and subject labels. Presumably it's possible that no rule could
be found, otherwise the prior initialization of may is pointless. If
this happens the following code treats it as though it always contains
access flags even though it might contain -ENOENT. Nothing bad actually
happens with a two's compliement representation of -ENOENT since it will
just set a bit that's already set, but it still seems like it should
have a may > 0 condition, for clarity if for no other reason.
Post by Casey Schaufler
Post by Seth Forshee
inode->i_security = new_inode_smack(skp);
if (inode->i_security == NULL)
@@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
break;
}
/*
+ * Don't use labels from xattrs for unprivileged mounts.
+ */
+ if (sb_in_userns(inode->i_sb))
+ break;
+ /*
Again, use the label. Just check to make sure it's what you expect.
What happens if it's not what I expect? smack_d_instantiate cannot fail
... so just use the default label? In that case why bother reading it at
all? Or would we actually want to change the on-disk label if it didn't
match?
Post by Casey Schaufler
Post by Seth Forshee
* No xattr support means, alas, no SMACK label.
* Use the aforeapplied default.
* It would be curious if the label of the task
Also untested.
Post by Seth Forshee
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Casey Schaufler
2015-07-22 18:10:46 UTC
Permalink
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So something like the diff below (untested)?
I think that this is close, and quite good for someone
who isn't very familiar with Smack. It's definitely headed
in the right direction.
Post by Seth Forshee
All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate().
Let's say your backing store is a file labeled Rubble.
mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
It is completely reasonable for a process labeled Flintstone to
have rwxa access to a file labeled Rubble.
Smack rule: Flintstone Rubble rwxa
In the case of writing to an existing Rubble file, what you
have looks fine. What's not so great is that if the Flintstone
process creates a file, it should be labeled Flintstone. Your
use of the smk_default, which is going to violate the principle
of least astonishment, and break the Smack policy as well.
Let's make a minor change. Instead of using smackfsroot let's
mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
Smack rule: Flintstone Rubble rwxat
Now the only change we have to make to the Smack code is
that we don't want to create any files unless either the
process is labeled Rubble or the rule allowing the creation
has the "t" for transmute access. That should ensure that
everything is labeled Rubble. If it isn't, someone has mucked
with the metadata in a detectable way.
All right, that kind of makes sense, but I'm still missing some pieces.
Questions follow.
Post by Casey Schaufler
Post by Seth Forshee
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}
+static inline bool sb_in_userns(struct super_block *sb)
+{
+ return sb->s_user_ns != &init_user_ns;
+}
extern bool inode_owner_or_capable(const struct inode *inode);
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;
+ /* Should never fetch xattrs from untrusted mounts */
+ if (WARN_ON(sb_in_userns(ip->i_sb)))
+ return ERR_PTR(-EPERM);
+
Go ahead and fetch it, we'll check to make sure it's viable later.
Post by Seth Forshee
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);
@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
*/
if (specified)
return -EPERM;
+
/*
- * Unprivileged mounts get root and default from the caller.
+ * User namespace mounts get root and default from the backing
+ * store, if there is one. Other unprivileged mounts get them
+ * from the caller.
*/
- skp = smk_of_current();
+ skp = (sb_in_userns(sb) && sb->s_bdev) ?
+ smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
sp->smk_flags |= SMK_INODE_TRANSMUTE;
I assume that you meant skp and not sp here.
Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
in the smk_flags field of the root inode. That's easy:

transmute = 1;

and the code after "Initialize the root inode" will take care of it.
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
*/
static int smack_inode_alloc_security(struct inode *inode)
{
- struct smack_known *skp = smk_of_current();
+ struct smack_known *skp;
+
+ if (sb_in_userns(inode->i_sb))
+ skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+ else
+ skp = smk_of_current();
This should be left alone.
smack_inode_init_security is where you could disallow access that doesn't
legitimately result in a Rubble label on the file. It's something like
... after the call may = smk_access_entry(...)
if (sb_in_userns(inode->i_sb))
if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
return -EACCES;
I'm not getting how this covers all cases.
So we've set the transmute flag on the root inode. Files and directories
created in the root directory get the same label, and directories also
get the transmute attribute. That's all fine.
What about an existing directory in the filesystem that already has a
Slate label? I'm not getting what happens with this directory, or for
new files created in this directory, which also relates to my other
questions below.
Also an aside - smk_access_entry looks weird. may is initialized to
-ENOENT, and then rule_list is searched for a rule which matches the
object and subject labels. Presumably it's possible that no rule could
be found, otherwise the prior initialization of may is pointless. If
this happens the following code treats it as though it always contains
access flags even though it might contain -ENOENT. Nothing bad actually
happens with a two's compliement representation of -ENOENT since it will
just set a bit that's already set, but it still seems like it should
have a may > 0 condition, for clarity if for no other reason.
My suggested code is just wrong. I wasn't looking at the whole code,
only the patch, and got myself confused. Apologies.

If we want to go straight for the jugular how about this? I'm assuming
that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.

static int smack_inode_permission(struct inode *inode, int mask)
{
struct smk_audit_info ad;
int no_block = mask & MAY_NOT_BLOCK;
int rc;

mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
/*
* No permission to check. Existence test. Yup, it's there.
*/
if (mask == 0)
return 0;

+ if (sb_in_userns(inode->i_sb)) &&
+ smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
+ return -EACCES;
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
smk_ad_setfield_u_fs_inode(&ad, inode);
rc = smk_curacc(smk_of_inode(inode), mask, &ad);
rc = smk_bu_inode(inode, mask, rc);
return rc;
}
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
inode->i_security = new_inode_smack(skp);
if (inode->i_security == NULL)
@@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
break;
}
/*
+ * Don't use labels from xattrs for unprivileged mounts.
+ */
+ if (sb_in_userns(inode->i_sb))
+ break;
+ /*
Again, use the label. Just check to make sure it's what you expect.
What happens if it's not what I expect? smack_d_instantiate cannot fail
... so just use the default label? In that case why bother reading it at
all? Or would we actually want to change the on-disk label if it didn't
match?
Post by Casey Schaufler
Post by Seth Forshee
* No xattr support means, alas, no SMACK label.
* Use the aforeapplied default.
* It would be curious if the label of the task
Also untested.
Post by Seth Forshee
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Seth Forshee
2015-07-22 19:32:23 UTC
Permalink
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So something like the diff below (untested)?
I think that this is close, and quite good for someone
who isn't very familiar with Smack. It's definitely headed
in the right direction.
Post by Seth Forshee
All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate().
Let's say your backing store is a file labeled Rubble.
mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
It is completely reasonable for a process labeled Flintstone to
have rwxa access to a file labeled Rubble.
Smack rule: Flintstone Rubble rwxa
In the case of writing to an existing Rubble file, what you
have looks fine. What's not so great is that if the Flintstone
process creates a file, it should be labeled Flintstone. Your
use of the smk_default, which is going to violate the principle
of least astonishment, and break the Smack policy as well.
Let's make a minor change. Instead of using smackfsroot let's
mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
Smack rule: Flintstone Rubble rwxat
Now the only change we have to make to the Smack code is
that we don't want to create any files unless either the
process is labeled Rubble or the rule allowing the creation
has the "t" for transmute access. That should ensure that
everything is labeled Rubble. If it isn't, someone has mucked
with the metadata in a detectable way.
All right, that kind of makes sense, but I'm still missing some pieces.
Questions follow.
Post by Casey Schaufler
Post by Seth Forshee
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}
+static inline bool sb_in_userns(struct super_block *sb)
+{
+ return sb->s_user_ns != &init_user_ns;
+}
extern bool inode_owner_or_capable(const struct inode *inode);
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;
+ /* Should never fetch xattrs from untrusted mounts */
+ if (WARN_ON(sb_in_userns(ip->i_sb)))
+ return ERR_PTR(-EPERM);
+
Go ahead and fetch it, we'll check to make sure it's viable later.
Post by Seth Forshee
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);
@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
*/
if (specified)
return -EPERM;
+
/*
- * Unprivileged mounts get root and default from the caller.
+ * User namespace mounts get root and default from the backing
+ * store, if there is one. Other unprivileged mounts get them
+ * from the caller.
*/
- skp = smk_of_current();
+ skp = (sb_in_userns(sb) && sb->s_bdev) ?
+ smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
sp->smk_flags |= SMK_INODE_TRANSMUTE;
I assume that you meant skp and not sp here.
Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
transmute = 1;
and the code after "Initialize the root inode" will take care of it.
Yeah, that's what I've actually done.
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
*/
static int smack_inode_alloc_security(struct inode *inode)
{
- struct smack_known *skp = smk_of_current();
+ struct smack_known *skp;
+
+ if (sb_in_userns(inode->i_sb))
+ skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+ else
+ skp = smk_of_current();
This should be left alone.
smack_inode_init_security is where you could disallow access that doesn't
legitimately result in a Rubble label on the file. It's something like
... after the call may = smk_access_entry(...)
if (sb_in_userns(inode->i_sb))
if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
return -EACCES;
I'm not getting how this covers all cases.
So we've set the transmute flag on the root inode. Files and directories
created in the root directory get the same label, and directories also
get the transmute attribute. That's all fine.
What about an existing directory in the filesystem that already has a
Slate label? I'm not getting what happens with this directory, or for
new files created in this directory, which also relates to my other
questions below.
Also an aside - smk_access_entry looks weird. may is initialized to
-ENOENT, and then rule_list is searched for a rule which matches the
object and subject labels. Presumably it's possible that no rule could
be found, otherwise the prior initialization of may is pointless. If
this happens the following code treats it as though it always contains
access flags even though it might contain -ENOENT. Nothing bad actually
happens with a two's compliement representation of -ENOENT since it will
just set a bit that's already set, but it still seems like it should
have a may > 0 condition, for clarity if for no other reason.
My suggested code is just wrong. I wasn't looking at the whole code,
only the patch, and got myself confused. Apologies.
If we want to go straight for the jugular how about this? I'm assuming
that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
Yes.
Post by Casey Schaufler
static int smack_inode_permission(struct inode *inode, int mask)
{
struct smk_audit_info ad;
int no_block = mask & MAY_NOT_BLOCK;
int rc;
mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
/*
* No permission to check. Existence test. Yup, it's there.
*/
if (mask == 0)
return 0;
+ if (sb_in_userns(inode->i_sb)) &&
+ smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
+ return -EACCES;
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
smk_ad_setfield_u_fs_inode(&ad, inode);
rc = smk_curacc(smk_of_inode(inode), mask, &ad);
rc = smk_bu_inode(inode, mask, rc);
return rc;
}
Hmm, okay. I think I've been a little confused all this time about how
you want to handle these unprivileged mounts.

Originally I thought you wanted all objects in the filesystem to get the
same label as the backing store. That's what I tried to implement
originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
assign every object (new and existing) smk_default and completely ignore
the labels on disk.

This is what I currently think you want for user ns mounts:

1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
3. For existing files:
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.

If this is right, there are a couple lingering questions in my mind.

First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.

The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.

So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.

Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.

Thanks,
Seth
Casey Schaufler
2015-07-23 00:05:17 UTC
Permalink
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So something like the diff below (untested)?
I think that this is close, and quite good for someone
who isn't very familiar with Smack. It's definitely headed
in the right direction.
Post by Seth Forshee
All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate().
Let's say your backing store is a file labeled Rubble.
mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
It is completely reasonable for a process labeled Flintstone to
have rwxa access to a file labeled Rubble.
Smack rule: Flintstone Rubble rwxa
In the case of writing to an existing Rubble file, what you
have looks fine. What's not so great is that if the Flintstone
process creates a file, it should be labeled Flintstone. Your
use of the smk_default, which is going to violate the principle
of least astonishment, and break the Smack policy as well.
Let's make a minor change. Instead of using smackfsroot let's
mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
Smack rule: Flintstone Rubble rwxat
Now the only change we have to make to the Smack code is
that we don't want to create any files unless either the
process is labeled Rubble or the rule allowing the creation
has the "t" for transmute access. That should ensure that
everything is labeled Rubble. If it isn't, someone has mucked
with the metadata in a detectable way.
All right, that kind of makes sense, but I'm still missing some pieces.
Questions follow.
Post by Casey Schaufler
Post by Seth Forshee
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}
+static inline bool sb_in_userns(struct super_block *sb)
+{
+ return sb->s_user_ns != &init_user_ns;
+}
extern bool inode_owner_or_capable(const struct inode *inode);
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;
+ /* Should never fetch xattrs from untrusted mounts */
+ if (WARN_ON(sb_in_userns(ip->i_sb)))
+ return ERR_PTR(-EPERM);
+
Go ahead and fetch it, we'll check to make sure it's viable later.
Post by Seth Forshee
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);
@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
*/
if (specified)
return -EPERM;
+
/*
- * Unprivileged mounts get root and default from the caller.
+ * User namespace mounts get root and default from the backing
+ * store, if there is one. Other unprivileged mounts get them
+ * from the caller.
*/
- skp = smk_of_current();
+ skp = (sb_in_userns(sb) && sb->s_bdev) ?
+ smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
sp->smk_flags |= SMK_INODE_TRANSMUTE;
I assume that you meant skp and not sp here.
Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
transmute = 1;
and the code after "Initialize the root inode" will take care of it.
Yeah, that's what I've actually done.
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
*/
static int smack_inode_alloc_security(struct inode *inode)
{
- struct smack_known *skp = smk_of_current();
+ struct smack_known *skp;
+
+ if (sb_in_userns(inode->i_sb))
+ skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+ else
+ skp = smk_of_current();
This should be left alone.
smack_inode_init_security is where you could disallow access that doesn't
legitimately result in a Rubble label on the file. It's something like
... after the call may = smk_access_entry(...)
if (sb_in_userns(inode->i_sb))
if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
return -EACCES;
I'm not getting how this covers all cases.
So we've set the transmute flag on the root inode. Files and directories
created in the root directory get the same label, and directories also
get the transmute attribute. That's all fine.
What about an existing directory in the filesystem that already has a
Slate label? I'm not getting what happens with this directory, or for
new files created in this directory, which also relates to my other
questions below.
Also an aside - smk_access_entry looks weird. may is initialized to
-ENOENT, and then rule_list is searched for a rule which matches the
object and subject labels. Presumably it's possible that no rule could
be found, otherwise the prior initialization of may is pointless. If
this happens the following code treats it as though it always contains
access flags even though it might contain -ENOENT. Nothing bad actually
happens with a two's compliement representation of -ENOENT since it will
just set a bit that's already set, but it still seems like it should
have a may > 0 condition, for clarity if for no other reason.
My suggested code is just wrong. I wasn't looking at the whole code,
only the patch, and got myself confused. Apologies.
If we want to go straight for the jugular how about this? I'm assuming
that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
Yes.
Post by Casey Schaufler
static int smack_inode_permission(struct inode *inode, int mask)
{
struct smk_audit_info ad;
int no_block = mask & MAY_NOT_BLOCK;
int rc;
mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
/*
* No permission to check. Existence test. Yup, it's there.
*/
if (mask == 0)
return 0;
+ if (sb_in_userns(inode->i_sb)) &&
+ smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
+ return -EACCES;
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
smk_ad_setfield_u_fs_inode(&ad, inode);
rc = smk_curacc(smk_of_inode(inode), mask, &ad);
rc = smk_bu_inode(inode, mask, rc);
return rc;
}
Hmm, okay. I think I've been a little confused all this time about how
you want to handle these unprivileged mounts.
Not your problem. I'm not the most consistent of reviewers.
Post by Seth Forshee
Originally I thought you wanted all objects in the filesystem to get the
same label as the backing store. That's what I tried to implement
originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
assign every object (new and existing) smk_default and completely ignore
the labels on disk.
I want everything to have the label of the backing store, but
I don't want to ignore it if it somehow got something else. Because
the only legitimate label for this example is Rubble, I want to
reject anything else that appears. If someone builds a filesystem
by hand with Slate labels I want it treated "safely".
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
Post by Seth Forshee
Thanks,
Seth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Eric W. Biederman
2015-07-23 00:15:19 UTC
Permalink
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So something like the diff below (untested)?
I think that this is close, and quite good for someone
who isn't very familiar with Smack. It's definitely headed
in the right direction.
Post by Seth Forshee
All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate().
Let's say your backing store is a file labeled Rubble.
mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
It is completely reasonable for a process labeled Flintstone to
have rwxa access to a file labeled Rubble.
Smack rule: Flintstone Rubble rwxa
In the case of writing to an existing Rubble file, what you
have looks fine. What's not so great is that if the Flintstone
process creates a file, it should be labeled Flintstone. Your
use of the smk_default, which is going to violate the principle
of least astonishment, and break the Smack policy as well.
Let's make a minor change. Instead of using smackfsroot let's
mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
Smack rule: Flintstone Rubble rwxat
Now the only change we have to make to the Smack code is
that we don't want to create any files unless either the
process is labeled Rubble or the rule allowing the creation
has the "t" for transmute access. That should ensure that
everything is labeled Rubble. If it isn't, someone has mucked
with the metadata in a detectable way.
All right, that kind of makes sense, but I'm still missing some pieces.
Questions follow.
Post by Casey Schaufler
Post by Seth Forshee
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}
+static inline bool sb_in_userns(struct super_block *sb)
+{
+ return sb->s_user_ns != &init_user_ns;
+}
extern bool inode_owner_or_capable(const struct inode *inode);
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;
+ /* Should never fetch xattrs from untrusted mounts */
+ if (WARN_ON(sb_in_userns(ip->i_sb)))
+ return ERR_PTR(-EPERM);
+
Go ahead and fetch it, we'll check to make sure it's viable later.
Post by Seth Forshee
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);
@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
*/
if (specified)
return -EPERM;
+
/*
- * Unprivileged mounts get root and default from the caller.
+ * User namespace mounts get root and default from the backing
+ * store, if there is one. Other unprivileged mounts get them
+ * from the caller.
*/
- skp = smk_of_current();
+ skp = (sb_in_userns(sb) && sb->s_bdev) ?
+ smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
sp->smk_flags |= SMK_INODE_TRANSMUTE;
I assume that you meant skp and not sp here.
Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
transmute = 1;
and the code after "Initialize the root inode" will take care of it.
Yeah, that's what I've actually done.
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
*/
static int smack_inode_alloc_security(struct inode *inode)
{
- struct smack_known *skp = smk_of_current();
+ struct smack_known *skp;
+
+ if (sb_in_userns(inode->i_sb))
+ skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+ else
+ skp = smk_of_current();
This should be left alone.
smack_inode_init_security is where you could disallow access that doesn't
legitimately result in a Rubble label on the file. It's something like
... after the call may = smk_access_entry(...)
if (sb_in_userns(inode->i_sb))
if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
return -EACCES;
I'm not getting how this covers all cases.
So we've set the transmute flag on the root inode. Files and directories
created in the root directory get the same label, and directories also
get the transmute attribute. That's all fine.
What about an existing directory in the filesystem that already has a
Slate label? I'm not getting what happens with this directory, or for
new files created in this directory, which also relates to my other
questions below.
Also an aside - smk_access_entry looks weird. may is initialized to
-ENOENT, and then rule_list is searched for a rule which matches the
object and subject labels. Presumably it's possible that no rule could
be found, otherwise the prior initialization of may is pointless. If
this happens the following code treats it as though it always contains
access flags even though it might contain -ENOENT. Nothing bad actually
happens with a two's compliement representation of -ENOENT since it will
just set a bit that's already set, but it still seems like it should
have a may > 0 condition, for clarity if for no other reason.
My suggested code is just wrong. I wasn't looking at the whole code,
only the patch, and got myself confused. Apologies.
If we want to go straight for the jugular how about this? I'm assuming
that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
Yes.
Post by Casey Schaufler
static int smack_inode_permission(struct inode *inode, int mask)
{
struct smk_audit_info ad;
int no_block = mask & MAY_NOT_BLOCK;
int rc;
mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
/*
* No permission to check. Existence test. Yup, it's there.
*/
if (mask == 0)
return 0;
+ if (sb_in_userns(inode->i_sb)) &&
+ smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
+ return -EACCES;
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
smk_ad_setfield_u_fs_inode(&ad, inode);
rc = smk_curacc(smk_of_inode(inode), mask, &ad);
rc = smk_bu_inode(inode, mask, rc);
return rc;
}
Hmm, okay. I think I've been a little confused all this time about how
you want to handle these unprivileged mounts.
Not your problem. I'm not the most consistent of reviewers.
Post by Seth Forshee
Originally I thought you wanted all objects in the filesystem to get the
same label as the backing store. That's what I tried to implement
originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
assign every object (new and existing) smk_default and completely ignore
the labels on disk.
I want everything to have the label of the backing store, but
I don't want to ignore it if it somehow got something else. Because
the only legitimate label for this example is Rubble, I want to
reject anything else that appears. If someone builds a filesystem
by hand with Slate labels I want it treated "safely".
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
That actually sounds very reasonable to me. It is essentially what we
do with uid and gids already. I presume the smack namespace support
would when integrated with all of this would allow a set of labels to be
set.

Have I missed a part of the conversation you talk about fileystems that
don't have support for storing labels? Filesystems like vfat, isofs,
etc.

Eric
Seth Forshee
2015-07-23 05:15:28 UTC
Permalink
Post by Eric W. Biederman
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So something like the diff below (untested)?
I think that this is close, and quite good for someone
who isn't very familiar with Smack. It's definitely headed
in the right direction.
Post by Seth Forshee
All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate().
Let's say your backing store is a file labeled Rubble.
mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
It is completely reasonable for a process labeled Flintstone to
have rwxa access to a file labeled Rubble.
Smack rule: Flintstone Rubble rwxa
In the case of writing to an existing Rubble file, what you
have looks fine. What's not so great is that if the Flintstone
process creates a file, it should be labeled Flintstone. Your
use of the smk_default, which is going to violate the principle
of least astonishment, and break the Smack policy as well.
Let's make a minor change. Instead of using smackfsroot let's
mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
Smack rule: Flintstone Rubble rwxat
Now the only change we have to make to the Smack code is
that we don't want to create any files unless either the
process is labeled Rubble or the rule allowing the creation
has the "t" for transmute access. That should ensure that
everything is labeled Rubble. If it isn't, someone has mucked
with the metadata in a detectable way.
All right, that kind of makes sense, but I'm still missing some pieces.
Questions follow.
Post by Casey Schaufler
Post by Seth Forshee
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}
+static inline bool sb_in_userns(struct super_block *sb)
+{
+ return sb->s_user_ns != &init_user_ns;
+}
extern bool inode_owner_or_capable(const struct inode *inode);
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;
+ /* Should never fetch xattrs from untrusted mounts */
+ if (WARN_ON(sb_in_userns(ip->i_sb)))
+ return ERR_PTR(-EPERM);
+
Go ahead and fetch it, we'll check to make sure it's viable later.
Post by Seth Forshee
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);
@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
*/
if (specified)
return -EPERM;
+
/*
- * Unprivileged mounts get root and default from the caller.
+ * User namespace mounts get root and default from the backing
+ * store, if there is one. Other unprivileged mounts get them
+ * from the caller.
*/
- skp = smk_of_current();
+ skp = (sb_in_userns(sb) && sb->s_bdev) ?
+ smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
sp->smk_flags |= SMK_INODE_TRANSMUTE;
I assume that you meant skp and not sp here.
Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
transmute = 1;
and the code after "Initialize the root inode" will take care of it.
Yeah, that's what I've actually done.
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
*/
static int smack_inode_alloc_security(struct inode *inode)
{
- struct smack_known *skp = smk_of_current();
+ struct smack_known *skp;
+
+ if (sb_in_userns(inode->i_sb))
+ skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+ else
+ skp = smk_of_current();
This should be left alone.
smack_inode_init_security is where you could disallow access that doesn't
legitimately result in a Rubble label on the file. It's something like
... after the call may = smk_access_entry(...)
if (sb_in_userns(inode->i_sb))
if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
return -EACCES;
I'm not getting how this covers all cases.
So we've set the transmute flag on the root inode. Files and directories
created in the root directory get the same label, and directories also
get the transmute attribute. That's all fine.
What about an existing directory in the filesystem that already has a
Slate label? I'm not getting what happens with this directory, or for
new files created in this directory, which also relates to my other
questions below.
Also an aside - smk_access_entry looks weird. may is initialized to
-ENOENT, and then rule_list is searched for a rule which matches the
object and subject labels. Presumably it's possible that no rule could
be found, otherwise the prior initialization of may is pointless. If
this happens the following code treats it as though it always contains
access flags even though it might contain -ENOENT. Nothing bad actually
happens with a two's compliement representation of -ENOENT since it will
just set a bit that's already set, but it still seems like it should
have a may > 0 condition, for clarity if for no other reason.
My suggested code is just wrong. I wasn't looking at the whole code,
only the patch, and got myself confused. Apologies.
If we want to go straight for the jugular how about this? I'm assuming
that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
Yes.
Post by Casey Schaufler
static int smack_inode_permission(struct inode *inode, int mask)
{
struct smk_audit_info ad;
int no_block = mask & MAY_NOT_BLOCK;
int rc;
mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
/*
* No permission to check. Existence test. Yup, it's there.
*/
if (mask == 0)
return 0;
+ if (sb_in_userns(inode->i_sb)) &&
+ smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
+ return -EACCES;
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
smk_ad_setfield_u_fs_inode(&ad, inode);
rc = smk_curacc(smk_of_inode(inode), mask, &ad);
rc = smk_bu_inode(inode, mask, rc);
return rc;
}
Hmm, okay. I think I've been a little confused all this time about how
you want to handle these unprivileged mounts.
Not your problem. I'm not the most consistent of reviewers.
Post by Seth Forshee
Originally I thought you wanted all objects in the filesystem to get the
same label as the backing store. That's what I tried to implement
originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
assign every object (new and existing) smk_default and completely ignore
the labels on disk.
I want everything to have the label of the backing store, but
I don't want to ignore it if it somehow got something else. Because
the only legitimate label for this example is Rubble, I want to
reject anything else that appears. If someone builds a filesystem
by hand with Slate labels I want it treated "safely".
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
That actually sounds very reasonable to me. It is essentially what we
do with uid and gids already. I presume the smack namespace support
would when integrated with all of this would allow a set of labels to be
set.
Have I missed a part of the conversation you talk about fileystems that
don't have support for storing labels? Filesystems like vfat, isofs,
etc.
As I read the code they should all end up with the superblock's
smk_default label for the objects in RAM, i.e. the label of the backing
store. The same would be true for existing files in a filesystem which
does support storing labels but has no labels on the files.

Seth
Casey Schaufler
2015-07-23 21:48:40 UTC
Permalink
Post by Eric W. Biederman
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.
Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.
The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.
Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.
So something like the diff below (untested)?
I think that this is close, and quite good for someone
who isn't very familiar with Smack. It's definitely headed
in the right direction.
Post by Seth Forshee
All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate().
Let's say your backing store is a file labeled Rubble.
mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
It is completely reasonable for a process labeled Flintstone to
have rwxa access to a file labeled Rubble.
Smack rule: Flintstone Rubble rwxa
In the case of writing to an existing Rubble file, what you
have looks fine. What's not so great is that if the Flintstone
process creates a file, it should be labeled Flintstone. Your
use of the smk_default, which is going to violate the principle
of least astonishment, and break the Smack policy as well.
Let's make a minor change. Instead of using smackfsroot let's
mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
Smack rule: Flintstone Rubble rwxat
Now the only change we have to make to the Smack code is
that we don't want to create any files unless either the
process is labeled Rubble or the rule allowing the creation
has the "t" for transmute access. That should ensure that
everything is labeled Rubble. If it isn't, someone has mucked
with the metadata in a detectable way.
All right, that kind of makes sense, but I'm still missing some pieces.
Questions follow.
Post by Casey Schaufler
Post by Seth Forshee
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}
+static inline bool sb_in_userns(struct super_block *sb)
+{
+ return sb->s_user_ns != &init_user_ns;
+}
extern bool inode_owner_or_capable(const struct inode *inode);
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;
+ /* Should never fetch xattrs from untrusted mounts */
+ if (WARN_ON(sb_in_userns(ip->i_sb)))
+ return ERR_PTR(-EPERM);
+
Go ahead and fetch it, we'll check to make sure it's viable later.
Post by Seth Forshee
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);
@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
*/
if (specified)
return -EPERM;
+
/*
- * Unprivileged mounts get root and default from the caller.
+ * User namespace mounts get root and default from the backing
+ * store, if there is one. Other unprivileged mounts get them
+ * from the caller.
*/
- skp = smk_of_current();
+ skp = (sb_in_userns(sb) && sb->s_bdev) ?
+ smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
sp->smk_flags |= SMK_INODE_TRANSMUTE;
I assume that you meant skp and not sp here.
Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
transmute = 1;
and the code after "Initialize the root inode" will take care of it.
Yeah, that's what I've actually done.
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
*/
static int smack_inode_alloc_security(struct inode *inode)
{
- struct smack_known *skp = smk_of_current();
+ struct smack_known *skp;
+
+ if (sb_in_userns(inode->i_sb))
+ skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+ else
+ skp = smk_of_current();
This should be left alone.
smack_inode_init_security is where you could disallow access that doesn't
legitimately result in a Rubble label on the file. It's something like
... after the call may = smk_access_entry(...)
if (sb_in_userns(inode->i_sb))
if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
return -EACCES;
I'm not getting how this covers all cases.
So we've set the transmute flag on the root inode. Files and directories
created in the root directory get the same label, and directories also
get the transmute attribute. That's all fine.
What about an existing directory in the filesystem that already has a
Slate label? I'm not getting what happens with this directory, or for
new files created in this directory, which also relates to my other
questions below.
Also an aside - smk_access_entry looks weird. may is initialized to
-ENOENT, and then rule_list is searched for a rule which matches the
object and subject labels. Presumably it's possible that no rule could
be found, otherwise the prior initialization of may is pointless. If
this happens the following code treats it as though it always contains
access flags even though it might contain -ENOENT. Nothing bad actually
happens with a two's compliement representation of -ENOENT since it will
just set a bit that's already set, but it still seems like it should
have a may > 0 condition, for clarity if for no other reason.
My suggested code is just wrong. I wasn't looking at the whole code,
only the patch, and got myself confused. Apologies.
If we want to go straight for the jugular how about this? I'm assuming
that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
Yes.
Post by Casey Schaufler
static int smack_inode_permission(struct inode *inode, int mask)
{
struct smk_audit_info ad;
int no_block = mask & MAY_NOT_BLOCK;
int rc;
mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
/*
* No permission to check. Existence test. Yup, it's there.
*/
if (mask == 0)
return 0;
+ if (sb_in_userns(inode->i_sb)) &&
+ smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
+ return -EACCES;
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
smk_ad_setfield_u_fs_inode(&ad, inode);
rc = smk_curacc(smk_of_inode(inode), mask, &ad);
rc = smk_bu_inode(inode, mask, rc);
return rc;
}
Hmm, okay. I think I've been a little confused all this time about how
you want to handle these unprivileged mounts.
Not your problem. I'm not the most consistent of reviewers.
Post by Seth Forshee
Originally I thought you wanted all objects in the filesystem to get the
same label as the backing store. That's what I tried to implement
originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
assign every object (new and existing) smk_default and completely ignore
the labels on disk.
I want everything to have the label of the backing store, but
I don't want to ignore it if it somehow got something else. Because
the only legitimate label for this example is Rubble, I want to
reject anything else that appears. If someone builds a filesystem
by hand with Slate labels I want it treated "safely".
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
That actually sounds very reasonable to me. It is essentially what we
do with uid and gids already. I presume the smack namespace support
would when integrated with all of this would allow a set of labels to be
set.
Have I missed a part of the conversation you talk about fileystems that
don't have support for storing labels? Filesystems like vfat, isofs,
etc.
They are easier. Set smackfsroot=Rubble,smackfsdef=Rubble and all objects
there will get labeled Rubble. Processes with different labels that can
write there will end up creating Rubble objects. For privileged mounts you
can set the values at will. For unprivileged mounts, you should take the
label values from the backing store.
Post by Eric W. Biederman
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Seth Forshee
2015-07-28 20:40:09 UTC
Permalink
Post by Casey Schaufler
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
All right, I've got a patch which I think does this, and I've managed to
do some testing to confirm that it behaves like I expect. How does this
look?

What's missing is getting the label from the block device inode; as
Stephen discovered the inode that I thought we could get the label from
turned out to be the wrong one. Afaict we would need a new hook in order
to do that, so for now I'm using the label of the proccess calling
mount.

---

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..8e631a66b03c 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
skp = smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
+ if (sb_in_userns(sb))
+ transmute = 1;
}
/*
* Initialize the root inode.
@@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
if (mask == 0)
return 0;

+ if (sb_in_userns(inode->i_sb)) {
+ struct superblock_smack *sbsp = inode->i_sb->s_security;
+ if (smk_of_inode(inode) != sbsp->smk_root)
+ return -EACCES;
+ }
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
@@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
- /*
- * Don't let the exec or mmap label be "*" or "@".
- */
- skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
- if (IS_ERR(skp) || skp == &smack_known_star ||
- skp == &smack_known_web)
- skp = NULL;
- isp->smk_task = skp;
+ if (!sb_in_userns(inode->i_sb)) {
+ /*
+ * Don't let the exec or mmap label be "*" or "@".
+ */
+ skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+ if (IS_ERR(skp) || skp == &smack_known_star ||
+ skp == &smack_known_web)
+ skp = NULL;
+ isp->smk_task = skp;
+ }

skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
if (IS_ERR(skp) || skp == &smack_known_star ||
Casey Schaufler
2015-07-30 16:18:16 UTC
Permalink
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
All right, I've got a patch which I think does this, and I've managed to
do some testing to confirm that it behaves like I expect. How does this
look?
What's missing is getting the label from the block device inode; as
Stephen discovered the inode that I thought we could get the label from
turned out to be the wrong one. Afaict we would need a new hook in order
to do that, so for now I'm using the label of the proccess calling
mount.
That will be OK if the mount processing checks for write access to
the backing store. I haven't looked to see if it does. If it doesn't
the problems should be pretty obvious.
Post by Seth Forshee
---
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..8e631a66b03c 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
skp = smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
+ if (sb_in_userns(sb))
+ transmute = 1;
}
/*
* Initialize the root inode.
@@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
if (mask == 0)
return 0;
+ if (sb_in_userns(inode->i_sb)) {
+ struct superblock_smack *sbsp = inode->i_sb->s_security;
+ if (smk_of_inode(inode) != sbsp->smk_root)
+ return -EACCES;
+ }
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
@@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
- /*
- */
- skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
- if (IS_ERR(skp) || skp == &smack_known_star ||
- skp == &smack_known_web)
- skp = NULL;
- isp->smk_task = skp;
+ if (!sb_in_userns(inode->i_sb)) {
+ /*
+ */
+ skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+ if (IS_ERR(skp) || skp == &smack_known_star ||
+ skp == &smack_known_web)
+ skp = NULL;
+ isp->smk_task = skp;
+ }
skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
if (IS_ERR(skp) || skp == &smack_known_star ||
Eric W. Biederman
2015-07-30 17:05:27 UTC
Permalink
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
All right, I've got a patch which I think does this, and I've managed to
do some testing to confirm that it behaves like I expect. How does this
look?
What's missing is getting the label from the block device inode; as
Stephen discovered the inode that I thought we could get the label from
turned out to be the wrong one. Afaict we would need a new hook in order
to do that, so for now I'm using the label of the proccess calling
mount.
That will be OK if the mount processing checks for write access to
the backing store. I haven't looked to see if it does. If it doesn't
the problems should be pretty obvious.
do_new_mount
vfs_kern_mount
mount_fs
...
mount_bdev
blkdev_get_by_path(...,FMODE_READ| FMODE_WRITE | FMODE_EXCL,...)
lookup_bdev
kern_path
filename_lookup
path_lookupat
lookup_last
walk_component
blkdev_get(...,mode,...)
__blkdev_get(...,mode,...)
devcgroup_inode_permission(bdev->bd_inode, perm)

*scratches my head*

It looks like we don't actually check the permissions on the block
device. Tomoyo has a hack for it. nfsd does something. There is
devcgroup silliness.

But overall it looks like we depend on capable(CAP_SYS_ADMIN).

Seth I do believe we have found another area of the vfs we will need to
short up before allowing unprivileged mounts of block device based
filesystems.

It looks like there are enough hacks someone with a clue coming through
and making the code make more sense seems like a good idea anyway.

Eric
Seth Forshee
2015-07-30 17:25:17 UTC
Permalink
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
All right, I've got a patch which I think does this, and I've managed to
do some testing to confirm that it behaves like I expect. How does this
look?
What's missing is getting the label from the block device inode; as
Stephen discovered the inode that I thought we could get the label from
turned out to be the wrong one. Afaict we would need a new hook in order
to do that, so for now I'm using the label of the proccess calling
mount.
That will be OK if the mount processing checks for write access to
the backing store. I haven't looked to see if it does. If it doesn't
the problems should be pretty obvious.
do_new_mount
vfs_kern_mount
mount_fs
...
mount_bdev
blkdev_get_by_path(...,FMODE_READ| FMODE_WRITE | FMODE_EXCL,...)
lookup_bdev
kern_path
filename_lookup
path_lookupat
lookup_last
walk_component
blkdev_get(...,mode,...)
__blkdev_get(...,mode,...)
devcgroup_inode_permission(bdev->bd_inode, perm)
*scratches my head*
It looks like we don't actually check the permissions on the block
device. Tomoyo has a hack for it. nfsd does something. There is
devcgroup silliness.
But overall it looks like we depend on capable(CAP_SYS_ADMIN).
Seth I do believe we have found another area of the vfs we will need to
short up before allowing unprivileged mounts of block device based
filesystems.
It looks like there are enough hacks someone with a clue coming through
and making the code make more sense seems like a good idea anyway.
Yep, I just came to the same conclusion myself, and I also verified the
behavior emperically. That's definitely a problem. I'll get to work on
fixing that.

Seth
Eric W. Biederman
2015-07-30 17:33:57 UTC
Permalink
Post by Seth Forshee
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
All right, I've got a patch which I think does this, and I've managed to
do some testing to confirm that it behaves like I expect. How does this
look?
What's missing is getting the label from the block device inode; as
Stephen discovered the inode that I thought we could get the label from
turned out to be the wrong one. Afaict we would need a new hook in order
to do that, so for now I'm using the label of the proccess calling
mount.
That will be OK if the mount processing checks for write access to
the backing store. I haven't looked to see if it does. If it doesn't
the problems should be pretty obvious.
do_new_mount
vfs_kern_mount
mount_fs
...
mount_bdev
blkdev_get_by_path(...,FMODE_READ| FMODE_WRITE | FMODE_EXCL,...)
lookup_bdev
kern_path
filename_lookup
path_lookupat
lookup_last
walk_component
blkdev_get(...,mode,...)
__blkdev_get(...,mode,...)
devcgroup_inode_permission(bdev->bd_inode, perm)
*scratches my head*
It looks like we don't actually check the permissions on the block
device. Tomoyo has a hack for it. nfsd does something. There is
devcgroup silliness.
But overall it looks like we depend on capable(CAP_SYS_ADMIN).
Seth I do believe we have found another area of the vfs we will need to
short up before allowing unprivileged mounts of block device based
filesystems.
It looks like there are enough hacks someone with a clue coming through
and making the code make more sense seems like a good idea anyway.
Yep, I just came to the same conclusion myself, and I also verified the
behavior emperically. That's definitely a problem. I'll get to work on
fixing that.
At a quick glance it looks like lookup_bdev, and most of it's callers
need to be modified to do potentially do the additional permission
checking.

I expect we could move the devcgroup checks into whatever new checks we
wind up adding.

Fun, fun fun.

Eric
Seth Forshee
2015-07-17 13:21:46 UTC
Permalink
On Thu, Jul 16, 2015 at 02:42:22PM -0700, Casey Schaufler wrote:

<snip>
Post by Casey Schaufler
Post by Seth Forshee
I welcome feedback about anything I've missed, but stating generally
that you think I probably missed something isn't very helpful.
True enough. I hope I've explained myself above.
Thanks, that definitely clarified where we were having a disconnect.
Andy's done a fantastic job explaining how those concerns are addressed.
Post by Casey Schaufler
Post by Seth Forshee
The LSM issue is thornier than the rest of it though, which is why I
specifically asked for review there in the cover letter. There's a lot
of complexity and nuance, and I still don't have a grasp on all the
subtleties. One such subtlety is the full impact of simply ignoring the
security labels on disk (but I am still confused as to why this is
different from filesystems which don't support xattrs at all).
If you can mount a filesystem such that the labels are ignored you
are effectively specifying that the Smack label on the files be
determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
Without it, it's not.
Post by Seth Forshee
I was unaware of Lukasz's patches until yesterday, and I will have a
look at them. But since we don't have the LSM support for user
namespaces yet, I don't see the problem with doing something safe for
LSMs initially and evolving the LSM integration for user ns mounts along
with the rest of the user ns integration.
Ignoring the security attributes is not safe!
Understood. It's surely safe for each LSM to deny such mounts until it
has a way to handle them safely though.

I'm not trying to completely punt on the issue of security modules, just
break this down into more manageable chunks. You've given good guidance
for Smack (thanks very much for that), so I can plan to work on that
soon.
Post by Casey Schaufler
Post by Seth Forshee
Your point is taken about my less-than-expert opinion about the other
security modules. We should at minimum get acks from the maintainers of
those modules that unprivileged mounts will not compromise MAC.
I am the Smack maintainer. Unprivileged mounts as you have
described them compromise MAC. They compromise DAC, too.
It looks like Andy's more or less convinced you that DAC isn't
(additionally?) compromised. And there's a plan for MAC, that the
security module can deny mounts from user namespaces until it has a
solution for allowing them safely.
Post by Casey Schaufler
Post by Seth Forshee
For Smack specifically, I believe my only concern was the SMACK64EXEC
attribute, as all the other attributes only affected subjects' access to
the files. So maybe it would be possible to simply ignore this attribute
in unprivileged mounts and respect the others, even lacking more
complete LSM support for user namespaces.
SMACK64EXEC is analogous to the setuid bit, but I would rather see
exec() of programs with this attribute refused that for it to be
blindly ignored.
That's fine, it's your call.

Thanks,
Seth
Casey Schaufler
2015-07-17 17:14:09 UTC
Permalink
Post by Seth Forshee
<snip>
Post by Casey Schaufler
Post by Seth Forshee
I welcome feedback about anything I've missed, but stating generally
that you think I probably missed something isn't very helpful.
True enough. I hope I've explained myself above.
Thanks, that definitely clarified where we were having a disconnect.
Andy's done a fantastic job explaining how those concerns are addressed.
Post by Casey Schaufler
Post by Seth Forshee
The LSM issue is thornier than the rest of it though, which is why I
specifically asked for review there in the cover letter. There's a lot
of complexity and nuance, and I still don't have a grasp on all the
subtleties. One such subtlety is the full impact of simply ignoring the
security labels on disk (but I am still confused as to why this is
different from filesystems which don't support xattrs at all).
If you can mount a filesystem such that the labels are ignored you
are effectively specifying that the Smack label on the files be
determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
Without it, it's not.
Post by Seth Forshee
I was unaware of Lukasz's patches until yesterday, and I will have a
look at them. But since we don't have the LSM support for user
namespaces yet, I don't see the problem with doing something safe for
LSMs initially and evolving the LSM integration for user ns mounts along
with the rest of the user ns integration.
Ignoring the security attributes is not safe!
Understood. It's surely safe for each LSM to deny such mounts until it
has a way to handle them safely though.
I'm not trying to completely punt on the issue of security modules, just
break this down into more manageable chunks. You've given good guidance
for Smack (thanks very much for that), so I can plan to work on that
soon.
Post by Casey Schaufler
Post by Seth Forshee
Your point is taken about my less-than-expert opinion about the other
security modules. We should at minimum get acks from the maintainers of
those modules that unprivileged mounts will not compromise MAC.
I am the Smack maintainer. Unprivileged mounts as you have
described them compromise MAC. They compromise DAC, too.
It looks like Andy's more or less convinced you that DAC isn't
(additionally?) compromised. And there's a plan for MAC, that the
security module can deny mounts from user namespaces until it has a
solution for allowing them safely.
I wouldn't say that Andy has me convinced on DAC. I would say that
he's taken me deeper into the details of namespaces than I feel
comfortable making arguments about. I don't know that he's right,
I just don't know how to argue that he isn't. Part of what bothers
me is the dependence on namespaces. If you could come up with a
mechanism that wasn't dependent on namespaces it would be much
easier for dinosaurs like me to comprehend.

As far as declaring that MAC and namespace owned mounts are
incompatible goes, I think that I said early on that wasn't
going to fly. Too much of the Linux population (Fedora, Android,
Tizen, ...) uses MAC for the feature to be considered ready
for general consumption without it. And no, I don't believe in
partial implementations. You wouldn't get away with putting this
in if it only worked on s370 processors.
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
For Smack specifically, I believe my only concern was the SMACK64EXEC
attribute, as all the other attributes only affected subjects' access to
the files. So maybe it would be possible to simply ignore this attribute
in unprivileged mounts and respect the others, even lacking more
complete LSM support for user namespaces.
SMACK64EXEC is analogous to the setuid bit, but I would rather see
exec() of programs with this attribute refused that for it to be
blindly ignored.
That's fine, it's your call.
I said it, but on reflection the current NOSETUID behavior is
as you described it, so I wouldn't change that.
Post by Seth Forshee
Thanks,
Seth
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Seth Forshee
2015-07-16 15:59:05 UTC
Permalink
Post by Seth Forshee
Post by Eric W. Biederman
diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..5b6ece92a8e5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
int security_sb_mount(const char *dev_name, struct path *path,
const char *type, unsigned long flags, void *data)
{
+ if (current_user_ns() != &init_user_ns)
+ return -EPERM;
return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
}
This just makes it impossible to mount from a user namespace. Every
mount from current_user_ns() != &init_user_ns will fail.
What might work instead is to add a check in security_sb_kern_mount.
Then it would need to check s_user_ns, that way if proc, sysfs, etc.
use sget_userns(..., &init_user_ns) they can still be mounted in
containers.

It would be nicer to have a hook after sget but before fill_super so
that a bunch of work doesn't have to be done and then undone. Right now
there doesn't seem to be any suitable hook.
Post by Seth Forshee
Post by Eric W. Biederman
Then we should push this down into all of the lsms.
Then when we should remove or relax or change the check as appropriate
in each lsm.
The point is this is good enough to see that it is trivially safe,
and this allows us to focus on the core issues, and stop worrying about
the lsms for a bit.
Then we can focus on each lsm one at at time and take the time to really
understand them and talk with their maintainers etc to make certain
we get things correct.
This should remove the need for your patches 5, 6 and 7. For the
immediate future.
I'm still not entirely sure what you were trying to do, maybe refuse to
mount whenever a security module is loaded? I think this could be a good
option to start, but couldn't we restrict it to only the LSMs which use
xattrs for security labels? In situations where the filesystem cannot
supply security policy metadata I can't think of any reason to disallow
the mounts.
Seth
Seth Forshee
2015-07-15 19:46:02 UTC
Permalink
Initially this will be used to eliminate the implicit MNT_NODEV
flag for mounts from user namespaces. In the future it will also
be used for translating ids and checking capabilities for
filesystems mounted from user namespaces.

s_user_ns is initialized in alloc_super() and is generally set to
current_user_ns(). To avoid security and corruption issues, two
additional mount checks are also added:

- do_new_mount() gains a check that the user has CAP_SYS_ADMIN
in current_user_ns().

- sget() will fail with EBUSY when the filesystem it's looking
for is already mounted from another user namespace.

proc needs some special handling here. The user namespace of
current isn't appropriate when forking as a result of clone (2)
with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
from within the new user namespace. Instead, the user namespace
which owns the new pid namespace should be used. sget_userns() is
added to allow passing of a user namespace other than that of
current, and this is used by proc_mount(). sget() becomes a
wrapper around sget_userns() which passes current_user_ns().

Signed-off-by: Seth Forshee <***@canonical.com>
---
fs/namespace.c | 3 +++
fs/proc/root.c | 3 ++-
fs/super.c | 38 +++++++++++++++++++++++++++++++++-----
include/linux/fs.h | 8 ++++++++
4 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ce428cadd41f..f1f67d663d49 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2357,6 +2357,9 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
struct vfsmount *mnt;
int err;

+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ return -EPERM;
+
if (!fstype)
return -EINVAL;

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 361ab4ee42fc..4b302cbf13f9 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -117,7 +117,8 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
return ERR_PTR(-EPERM);
}

- sb = sget(fs_type, proc_test_super, proc_set_super, flags, ns);
+ sb = sget_userns(fs_type, proc_test_super, proc_set_super, flags,
+ ns->user_ns, ns);
if (IS_ERR(sb))
return ERR_CAST(sb);

diff --git a/fs/super.c b/fs/super.c
index b61372354f2b..b5f171aadbf7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -33,6 +33,7 @@
#include <linux/cleancache.h>
#include <linux/fsnotify.h>
#include <linux/lockdep.h>
+#include <linux/user_namespace.h>
#include "internal.h"


@@ -148,6 +149,7 @@ static void destroy_super(struct super_block *s)
list_lru_destroy(&s->s_inode_lru);
for (i = 0; i < SB_FREEZE_LEVELS; i++)
percpu_counter_destroy(&s->s_writers.counter[i]);
+ put_user_ns(s->s_user_ns);
security_sb_free(s);
WARN_ON(!list_empty(&s->s_mounts));
kfree(s->s_subtype);
@@ -163,7 +165,8 @@ static void destroy_super(struct super_block *s)
* Allocates and initializes a new &struct super_block. alloc_super()
* returns a pointer new superblock or %NULL if allocation had failed.
*/
-static struct super_block *alloc_super(struct file_system_type *type, int flags)
+static struct super_block *alloc_super(struct file_system_type *type, int flags,
+ struct user_namespace *user_ns)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
static const struct super_operations default_op;
@@ -231,6 +234,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
s->s_shrink.count_objects = super_cache_count;
s->s_shrink.batch = 1024;
s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
+
+ s->s_user_ns = get_user_ns(user_ns);
return s;

fail:
@@ -427,17 +432,17 @@ void generic_shutdown_super(struct super_block *sb)
EXPORT_SYMBOL(generic_shutdown_super);

/**
- * sget - find or create a superblock
+ * sget_userns - find or create a superblock
* @type: filesystem type superblock should belong to
* @test: comparison callback
* @set: setup callback
* @flags: mount flags
* @data: argument to each of them
*/
-struct super_block *sget(struct file_system_type *type,
+struct super_block *sget_userns(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
- int flags,
+ int flags, struct user_namespace *user_ns,
void *data)
{
struct super_block *s = NULL;
@@ -450,6 +455,10 @@ retry:
hlist_for_each_entry(old, &type->fs_supers, s_instances) {
if (!test(old, data))
continue;
+ if (user_ns != old->s_user_ns) {
+ spin_unlock(&sb_lock);
+ return ERR_PTR(-EBUSY);
+ }
if (!grab_super(old))
goto retry;
if (s) {
@@ -462,7 +471,7 @@ retry:
}
if (!s) {
spin_unlock(&sb_lock);
- s = alloc_super(type, flags);
+ s = alloc_super(type, flags, user_ns);
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
@@ -485,6 +494,25 @@ retry:
return s;
}

+EXPORT_SYMBOL(sget_userns);
+
+/**
+ * sget - find or create a superblock
+ * @type: filesystem type superblock should belong to
+ * @test: comparison callback
+ * @set: setup callback
+ * @flags: mount flags
+ * @data: argument to each of them
+ */
+struct super_block *sget(struct file_system_type *type,
+ int (*test)(struct super_block *,void *),
+ int (*set)(struct super_block *,void *),
+ int flags,
+ void *data)
+{
+ return sget_userns(type, test, set, flags, current_user_ns(), data);
+}
+
EXPORT_SYMBOL(sget);

void drop_super(struct super_block *sb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42912f8d286e..1876477ac9f8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -30,6 +30,7 @@
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/blk_types.h>
+#include <linux/user_namespace.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1353,6 +1354,8 @@ struct super_block {
struct workqueue_struct *s_dio_done_wq;
struct hlist_head s_pins;

+ struct user_namespace *s_user_ns;
+
/*
* Keep the lru lists last in the structure so they always sit on their
* own individual cachelines.
@@ -1959,6 +1962,11 @@ void deactivate_locked_super(struct super_block *sb);
int set_anon_super(struct super_block *s, void *data);
int get_anon_bdev(dev_t *);
void free_anon_bdev(dev_t);
+struct super_block *sget_userns(struct file_system_type *type,
+ int (*test)(struct super_block *,void *),
+ int (*set)(struct super_block *,void *),
+ int flags, struct user_namespace *user_ns,
+ void *data);
struct super_block *sget(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
--
1.9.1
Eric W. Biederman
2015-07-16 02:47:11 UTC
Permalink
Post by Seth Forshee
Initially this will be used to eliminate the implicit MNT_NODEV
flag for mounts from user namespaces. In the future it will also
be used for translating ids and checking capabilities for
filesystems mounted from user namespaces.
s_user_ns is initialized in alloc_super() and is generally set to
current_user_ns(). To avoid security and corruption issues, two
- do_new_mount() gains a check that the user has CAP_SYS_ADMIN
in current_user_ns().
- sget() will fail with EBUSY when the filesystem it's looking
for is already mounted from another user namespace.
proc needs some special handling here. The user namespace of
current isn't appropriate when forking as a result of clone (2)
with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
from within the new user namespace. Instead, the user namespace
which owns the new pid namespace should be used. sget_userns() is
added to allow passing of a user namespace other than that of
current, and this is used by proc_mount(). sget() becomes a
wrapper around sget_userns() which passes current_user_ns().
From bits of the previous conversation.

We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
xattrs can travel from one mount of sysfs to another via the sysfs
backing store.

For tmpfs and any other filesystems we support mounting without
privilige that support xattrs. We need to identify them and
see if userspace is taking advantage of the ability to set
xattrs and file caps (unlikely). If they are we need to call
sget_userns(..., &init_user_ns) on those filesystems as well.

Possibly/Probably we should just do that for all of the interesting
filesystems to start with and then change back to an ordinary old sget
after we have done the testing and confirmed we will not be introducing
userspace regressions.

Eric
Seth Forshee
2015-07-15 19:46:03 UTC
Permalink
From: "Eric W. Biederman" <***@xmission.com>

- Consolidate the testing if a device node may be opened in a new
function may_open_dev.

- Move the check for allowing access to device nodes on filesystems
not mounted in the initial user namespace from mount time to open
time and include it in may_open_dev.

This set of changes removes the implicit adding of MNT_NODEV which
simplifies the logic in fs/namespace.c and removes a potentially
problematic user visible difference in how normal and unprivileged
mount namespaces work.

Signed-off-by: "Eric W. Biederman" <***@xmission.com>
---
fs/block_dev.c | 2 +-
fs/namei.c | 9 ++++++++-
fs/namespace.c | 18 ++++--------------
include/linux/fs.h | 1 +
4 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 198243717da5..f8ce371c437c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1729,7 +1729,7 @@ struct block_device *lookup_bdev(const char *pathname)
if (!S_ISBLK(inode->i_mode))
goto fail;
error = -EACCES;
- if (path.mnt->mnt_flags & MNT_NODEV)
+ if (!may_open_dev(&path))
goto fail;
error = -ENOMEM;
bdev = bd_acquire(inode);
diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c18b2ac..87c54cb34dce 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2635,6 +2635,13 @@ int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
}
EXPORT_SYMBOL(vfs_create);

+bool may_open_dev(const struct path *path)
+{
+ return !(path->mnt->mnt_flags & MNT_NODEV) &&
+ ((path->mnt->mnt_sb->s_user_ns == &init_user_ns) ||
+ (path->mnt->mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT));
+}
+
static int may_open(struct path *path, int acc_mode, int flag)
{
struct dentry *dentry = path->dentry;
@@ -2657,7 +2664,7 @@ static int may_open(struct path *path, int acc_mode, int flag)
break;
case S_IFBLK:
case S_IFCHR:
- if (path->mnt->mnt_flags & MNT_NODEV)
+ if (!may_open_dev(path))
return -EACCES;
/*FALLTHRU*/
case S_IFIFO:
diff --git a/fs/namespace.c b/fs/namespace.c
index f1f67d663d49..423001de32a2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2153,13 +2153,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
}
if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
!(mnt_flags & MNT_NODEV)) {
- /* Was the nodev implicitly added in mount? */
- if ((mnt->mnt_ns->user_ns != &init_user_ns) &&
- !(sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT)) {
- mnt_flags |= MNT_NODEV;
- } else {
- return -EPERM;
- }
+ return -EPERM;
}
if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
!(mnt_flags & MNT_NOSUID)) {
@@ -2372,13 +2366,6 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
put_filesystem(type);
return -EPERM;
}
- /* Only in special cases allow devices from mounts
- * created outside the initial user namespace.
- */
- if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
- flags |= MS_NODEV;
- mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
- }
if (type->fs_flags & FS_USERNS_VISIBLE) {
if (!fs_fully_visible(type, &mnt_flags))
return -EPERM;
@@ -3214,6 +3201,9 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
mnt_flags = mnt->mnt.mnt_flags;
if (mnt->mnt.mnt_sb->s_iflags & SB_I_NOEXEC)
mnt_flags &= ~(MNT_LOCK_NOSUID | MNT_LOCK_NOEXEC);
+ if (mnt->mnt.mnt_sb->s_user_ns != &init_user_ns &&
+ !(mnt->mnt.mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT))
+ mnt_flags &= ~(MNT_LOCK_NODEV);

/* Verify the mount flags are equal to or more permissive
* than the proposed new mount.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1876477ac9f8..a0db522196ab 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1512,6 +1512,7 @@ extern void dentry_unhash(struct dentry *dentry);
*/
extern void inode_init_owner(struct inode *inode, const struct inode *dir,
umode_t mode);
+extern bool may_open_dev(const struct path *path);
/*
* VFS FS_IOC_FIEMAP helper definitions.
*/
--
1.9.1
Seth Forshee
2015-07-15 19:46:04 UTC
Permalink
Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.

Add a new helper function, in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.

Signed-off-by: Seth Forshee <***@canonical.com>
---
include/linux/user_namespace.h | 8 ++++++++
kernel/user_namespace.c | 14 ++++++++++++++
security/commoncap.c | 2 ++
3 files changed, 24 insertions(+)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8297e5b341d8..a43faa727124 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -72,6 +72,8 @@ extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t,
extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *);
extern int proc_setgroups_show(struct seq_file *m, void *v);
extern bool userns_may_setgroups(const struct user_namespace *ns);
+extern bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns);
#else

static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -100,6 +102,12 @@ static inline bool userns_may_setgroups(const struct user_namespace *ns)
{
return true;
}
+
+static inline bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns)
+{
+ return true;
+}
#endif

#endif /* _LINUX_USER_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 4109f8320684..2b043876d5f0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -944,6 +944,20 @@ bool userns_may_setgroups(const struct user_namespace *ns)
return allowed;
}

+/*
+ * Returns true if @ns is the same namespace as or a descendant of
+ * @target_ns.
+ */
+bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns)
+{
+ for (; ns; ns = ns->parent) {
+ if (ns == target_ns)
+ return true;
+ }
+ return false;
+}
+
static inline struct user_namespace *to_user_ns(struct ns_common *ns)
{
return container_of(ns, struct user_namespace, ns);
diff --git a/security/commoncap.c b/security/commoncap.c
index d103f5a4043d..175ab497e810 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -439,6 +439,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c

if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
return 0;
+ if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
+ return 0;

rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
if (rc < 0) {
--
1.9.1
Serge E. Hallyn
2015-07-15 21:48:48 UTC
Permalink
Post by Seth Forshee
Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.
Add a new helper function, in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.
Acked-by: Serge Hallyn <***@canonical.com>

I think it's an ok behavior, though let's just go over the
alternatives.

It might actually be ok to simply require that the user_ns be
equal. If I unshare a new userns in which a different uid is
mapped to root, I may not want file capabilities to be granted
to tasks in that ns. (On the other hand, I might be creating
a new user_ns specifically to not have a uid 0 mapped into it
at all, and only have file capabilities grant privilege)

Conversely, if I unshare one user_ns with a MS_SHARED mnt_ns, mount
an ext4fs, and then (from the parent shell) unshare another user_ns
with the same mapping, intending it to be a "peer" to the first one
I'd unshared and be able to use the ext4fs it mounted. This won't
work here. That's probably best - the appropriate thing to do was
to attach to the existing user_ns. But it could end up being
limiting in some special cases, so I'm bringing it up here.

Again I think what you have here is the simplest and most sensible
choice, so ack.
Post by Seth Forshee
---
include/linux/user_namespace.h | 8 ++++++++
kernel/user_namespace.c | 14 ++++++++++++++
security/commoncap.c | 2 ++
3 files changed, 24 insertions(+)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8297e5b341d8..a43faa727124 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -72,6 +72,8 @@ extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t,
extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *);
extern int proc_setgroups_show(struct seq_file *m, void *v);
extern bool userns_may_setgroups(const struct user_namespace *ns);
+extern bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns);
#else
static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -100,6 +102,12 @@ static inline bool userns_may_setgroups(const struct user_namespace *ns)
{
return true;
}
+
+static inline bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns)
+{
+ return true;
+}
#endif
#endif /* _LINUX_USER_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 4109f8320684..2b043876d5f0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -944,6 +944,20 @@ bool userns_may_setgroups(const struct user_namespace *ns)
return allowed;
}
+/*
+ */
+bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns)
+{
+ for (; ns; ns = ns->parent) {
+ if (ns == target_ns)
+ return true;
+ }
+ return false;
+}
+
static inline struct user_namespace *to_user_ns(struct ns_common *ns)
{
return container_of(ns, struct user_namespace, ns);
diff --git a/security/commoncap.c b/security/commoncap.c
index d103f5a4043d..175ab497e810 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -439,6 +439,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
return 0;
+ if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
+ return 0;
rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
if (rc < 0) {
--
1.9.1
Andy Lutomirski
2015-07-15 21:50:46 UTC
Permalink
Post by Serge E. Hallyn
Post by Seth Forshee
Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.
Add a new helper function, in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.
I think it's an ok behavior, though let's just go over the
alternatives.
It might actually be ok to simply require that the user_ns be
equal. If I unshare a new userns in which a different uid is
mapped to root, I may not want file capabilities to be granted
to tasks in that ns. (On the other hand, I might be creating
a new user_ns specifically to not have a uid 0 mapped into it
at all, and only have file capabilities grant privilege)
Conversely, if I unshare one user_ns with a MS_SHARED mnt_ns, mount
an ext4fs, and then (from the parent shell) unshare another user_ns
with the same mapping, intending it to be a "peer" to the first one
I'd unshared and be able to use the ext4fs it mounted. This won't
work here. That's probably best - the appropriate thing to do was
to attach to the existing user_ns. But it could end up being
limiting in some special cases, so I'm bringing it up here.
Again I think what you have here is the simplest and most sensible
choice, so ack.
I think I'm missing something. Why is this separate from mount_may_suid?

I can see why it would make sense to check s_user_ns (or maybe
s_user_ns *and* the vfsmount namespace) in mount_may_suid, but I don't
see why we need separate checks.

--Andy
Eric W. Biederman
2015-07-15 22:35:24 UTC
Permalink
Post by Andy Lutomirski
Post by Serge E. Hallyn
Post by Seth Forshee
Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.
Add a new helper function, in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.
I think it's an ok behavior, though let's just go over the
alternatives.
It might actually be ok to simply require that the user_ns be
equal. If I unshare a new userns in which a different uid is
mapped to root, I may not want file capabilities to be granted
to tasks in that ns. (On the other hand, I might be creating
a new user_ns specifically to not have a uid 0 mapped into it
at all, and only have file capabilities grant privilege)
Conversely, if I unshare one user_ns with a MS_SHARED mnt_ns, mount
an ext4fs, and then (from the parent shell) unshare another user_ns
with the same mapping, intending it to be a "peer" to the first one
I'd unshared and be able to use the ext4fs it mounted. This won't
work here. That's probably best - the appropriate thing to do was
to attach to the existing user_ns. But it could end up being
limiting in some special cases, so I'm bringing it up here.
Again I think what you have here is the simplest and most sensible
choice, so ack.
I think I'm missing something. Why is this separate from mount_may_suid?
I can see why it would make sense to check s_user_ns (or maybe
s_user_ns *and* the vfsmount namespace) in mount_may_suid, but I don't
see why we need separate checks.
So I don't quite understand your concerns that lead to the mnt_may_suid
patch. But in my limited understanding there are two distinct issues.

1) What do file capabilities mean on a filesystem mounted with user
namespace privileges. Where the unprivileged user can control what
resides on disk.

That is what this patch should be about.

Meaning and restricting those permissions to unprivileged users.

2) The second issue that I think your mnt_may_suid patch is about seems
to be what to do if a mount winds up in a place we never intended.

Aka leaks. I don't think any changes to mnt_may_suid are necessary
in that sense. However they may be useful.

So I think your mnt_may_suid change may be worth having but so far it
seems unnecessary.

Which is a long way of saying this patch is fundamentally necessary,
and I am not certain about the mnt_may_suid patch.

Am I right in understanding it's purpose? Or does this patch actually
succeed in obsoleting it?

Eric
Seth Forshee
2015-07-16 01:14:10 UTC
Permalink
Post by Eric W. Biederman
Post by Andy Lutomirski
Post by Serge E. Hallyn
Post by Seth Forshee
Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.
Add a new helper function, in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.
I think it's an ok behavior, though let's just go over the
alternatives.
It might actually be ok to simply require that the user_ns be
equal. If I unshare a new userns in which a different uid is
mapped to root, I may not want file capabilities to be granted
to tasks in that ns. (On the other hand, I might be creating
a new user_ns specifically to not have a uid 0 mapped into it
at all, and only have file capabilities grant privilege)
Conversely, if I unshare one user_ns with a MS_SHARED mnt_ns, mount
an ext4fs, and then (from the parent shell) unshare another user_ns
with the same mapping, intending it to be a "peer" to the first one
I'd unshared and be able to use the ext4fs it mounted. This won't
work here. That's probably best - the appropriate thing to do was
to attach to the existing user_ns. But it could end up being
limiting in some special cases, so I'm bringing it up here.
Again I think what you have here is the simplest and most sensible
choice, so ack.
I think I'm missing something. Why is this separate from mount_may_suid?
I can see why it would make sense to check s_user_ns (or maybe
s_user_ns *and* the vfsmount namespace) in mount_may_suid, but I don't
see why we need separate checks.
So I don't quite understand your concerns that lead to the mnt_may_suid
patch. But in my limited understanding there are two distinct issues.
1) What do file capabilities mean on a filesystem mounted with user
namespace privileges. Where the unprivileged user can control what
resides on disk.
That is what this patch should be about.
Meaning and restricting those permissions to unprivileged users.
2) The second issue that I think your mnt_may_suid patch is about seems
to be what to do if a mount winds up in a place we never intended.
Aka leaks. I don't think any changes to mnt_may_suid are necessary
in that sense. However they may be useful.
So I think your mnt_may_suid change may be worth having but so far it
seems unnecessary.
Which is a long way of saying this patch is fundamentally necessary,
and I am not certain about the mnt_may_suid patch.
Am I right in understanding it's purpose? Or does this patch actually
succeed in obsoleting it?
The only part that's absolutely needed is the restriction on file caps,
otherwise it will be trivial to get root through a user namespace mount.
I've become convinced that the safest and most logical thing to do is to
restrict file capabilites to the user namespaces where the mounter
already has privileges, which is what the patch does.

mnt_may_suid would also restrict the namespaces where the capabilities
would be honored, but not to only namespaces where the mounter is
already privileged. Of course it does require a user privileged in
another namespace to perform a mount, but that still leaves me feeling a
bit uncomfortable.

suid doesn't require quite so strict a check because (jumping ahead to
the patches I haven't sent yet) ids in a user namespace mount of a
normal filesystem are constrained to ids in that namespace. So users
could only exploit this to suid to ids they already control, or if they
managed to somehow bypass other kernel protections they could possibly
gain access to user ns mounts belonging to another user.

So if we have the s_user_ns check in get_file_caps the mnt_may_suid pass
isn't strictly necessary, but I still think it is useful as a mitigation
to the "leaks" Eric mentions. It _should_ be impossible for a user to
gain access to another user's mount namespace, and it _should_ be
impossible for a user to clear MNT_NOSUID in a bind mount from
init_user_ns. But if someone does find a way to do either then the patch
stops them from being able to gain privileges via suid, and I think
that's worth adding the check.

Andy alludes to the possibility of checking s_user_ns or both s_user_ns
and the mount namespace in mnt_may_suid, and those are certainly
possibilities that would work equally well (though checking both is
probably unnecessary). One thing I came away with from conversing with
Eric though is that he wants to see a clear and explicit check in
get_file_caps, not something implicit from may_mnt_suid. And I can see
his point - there is a concern with file capabilities independent of the
question of whether suid is allowed, and having a separate check does
make that clearer.

Seth
Andy Lutomirski
2015-07-16 01:23:01 UTC
Permalink
On Wed, Jul 15, 2015 at 6:14 PM, Seth Forshee
Post by Seth Forshee
mnt_may_suid would also restrict the namespaces where the capabilities
would be honored, but not to only namespaces where the mounter is
already privileged. Of course it does require a user privileged in
another namespace to perform a mount, but that still leaves me feeling a
bit uncomfortable.
Right. I think mnt_may_suid should check s_user_ns in addition.
Post by Seth Forshee
suid doesn't require quite so strict a check because (jumping ahead to
the patches I haven't sent yet) ids in a user namespace mount of a
normal filesystem are constrained to ids in that namespace. So users
could only exploit this to suid to ids they already control, or if they
managed to somehow bypass other kernel protections they could possibly
gain access to user ns mounts belonging to another user.
True. But LSMs labels probably want the same protection as file caps,
and the mnt_no_suid approach handles that, too. (Your patches also do
this, but maybe we'd want to relax that some day for LSMs that are
scoped sensibly.)
Post by Seth Forshee
So if we have the s_user_ns check in get_file_caps the mnt_may_suid pass
isn't strictly necessary, but I still think it is useful as a mitigation
to the "leaks" Eric mentions. It _should_ be impossible for a user to
gain access to another user's mount namespace,
No, it's very easy with SCM_RIGHTS. We should make sure it's safe.
Post by Seth Forshee
Andy alludes to the possibility of checking s_user_ns or both s_user_ns
and the mount namespace in mnt_may_suid, and those are certainly
possibilities that would work equally well (though checking both is
probably unnecessary). One thing I came away with from conversing with
Eric though is that he wants to see a clear and explicit check in
get_file_caps, not something implicit from may_mnt_suid. And I can see
his point - there is a concern with file capabilities independent of the
question of whether suid is allowed, and having a separate check does
make that clearer.
But we absolutely need MS_NOSUID to block file caps, and it does. Why
not just use the existing mechanism with an expanded sense of
"nosuid"?

--Andy
Seth Forshee
2015-07-16 13:06:07 UTC
Permalink
Post by Andy Lutomirski
Post by Seth Forshee
So if we have the s_user_ns check in get_file_caps the mnt_may_suid pass
isn't strictly necessary, but I still think it is useful as a mitigation
to the "leaks" Eric mentions. It _should_ be impossible for a user to
gain access to another user's mount namespace,
No, it's very easy with SCM_RIGHTS. We should make sure it's safe.
Sure, what I really meant was that an attacker shouldn't be able to do
so without cooperation from the other user's processes. But I think
we're all in agreement that making it safe is a good idea.

Seth
Andy Lutomirski
2015-07-16 01:19:46 UTC
Permalink
On Wed, Jul 15, 2015 at 3:35 PM, Eric W. Biederman
Post by Eric W. Biederman
Post by Andy Lutomirski
Post by Serge E. Hallyn
Post by Seth Forshee
Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.
Add a new helper function, in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.
I think it's an ok behavior, though let's just go over the
alternatives.
It might actually be ok to simply require that the user_ns be
equal. If I unshare a new userns in which a different uid is
mapped to root, I may not want file capabilities to be granted
to tasks in that ns. (On the other hand, I might be creating
a new user_ns specifically to not have a uid 0 mapped into it
at all, and only have file capabilities grant privilege)
Conversely, if I unshare one user_ns with a MS_SHARED mnt_ns, mount
an ext4fs, and then (from the parent shell) unshare another user_ns
with the same mapping, intending it to be a "peer" to the first one
I'd unshared and be able to use the ext4fs it mounted. This won't
work here. That's probably best - the appropriate thing to do was
to attach to the existing user_ns. But it could end up being
limiting in some special cases, so I'm bringing it up here.
Again I think what you have here is the simplest and most sensible
choice, so ack.
I think I'm missing something. Why is this separate from mount_may_suid?
I can see why it would make sense to check s_user_ns (or maybe
s_user_ns *and* the vfsmount namespace) in mount_may_suid, but I don't
see why we need separate checks.
So I don't quite understand your concerns that lead to the mnt_may_suid
patch. But in my limited understanding there are two distinct issues.
The issue is that we need some kind of control for whether a given
operation should trust a given mounted filesystem. There are two
kinds of trust: trusting the fs for execve security context (nosuid
controls this) and trusting it for LSM access restrictions. I think
that, in an unprivileged namespace context, the latter is a bit silly
-- the creator of the fs owns it, full stop. I'm talking about the
former.

In particular, If I unshare everything, mount a fresh FUSE, shove a
setuid, fcapped, LSM-labeled thing in it, pass a file descriptor out,
and have someone in the root ns execve it, and *pwned*.

My suggestion is to use a single function to control this, and I
called it mnt_may_suid. We can certainly debate when that function
should return true, but I'm unconvinced that the conditions for LSM
and for regular setuid should be different.
Post by Eric W. Biederman
1) What do file capabilities mean on a filesystem mounted with user
namespace privileges. Where the unprivileged user can control what
resides on disk.
That is what this patch should be about.
Meaning and restricting those permissions to unprivileged users.
I think that file caps should mean what they usually do if the execve
caller's userns should trust the file. Otherwise file caps should do
nothing.

My original idea was that a namespace trusts a vfsmount if the
namespace or one of its ancestors created the mount. Doing the same
thing but with s_user_ns might also make sense.
Post by Eric W. Biederman
2) The second issue that I think your mnt_may_suid patch is about seems
to be what to do if a mount winds up in a place we never intended.
Aka leaks. I don't think any changes to mnt_may_suid are necessary
in that sense. However they may be useful.
So I think your mnt_may_suid change may be worth having but so far it
seems unnecessary.
There's that, too. For one thing, with my mnt_may_suid patch (or a
variant that checks the vfsmount and s_user_ns), we could drop the
bind-mount nosuid restrictions. If you want to bind-mount an
MS_NOSUID mount without MS_NOSUID, then that's fine -- you can't do
any harm.
Post by Eric W. Biederman
Which is a long way of saying this patch is fundamentally necessary,
and I am not certain about the mnt_may_suid patch.
Am I right in understanding it's purpose? Or does this patch actually
succeed in obsoleting it?
Other way around. I think that an improved mnt_may_suid patch might
render this patch unnecessary.

--Andy
Eric W. Biederman
2015-07-16 04:23:55 UTC
Permalink
Ok. Andy I have stopped and really looked at your patch that is 4/7 in
this series. Something I had not done before since it sounded totally
wrong.

That combined with your earlier comments I think I can say something
meaningful.

Andy as I read your patch the thread you are primarily worried about is
chdir(/some/directory/in/another/mnt/ns). I think enhancing nosuid to
deal with that case is reasonable, and is unlikely to break userspace.
It is one of those hairy security things so we need to be careful not to
introduce a regression.

I think a top down enhancement of nosuid to just block funny cases that
no one cares about is completely sensible. Removing goofy corner
that no one cares about and that are only good for security exploits
seems reasonable.

I am a little concerned that smack does not seem to respect nosuid
on filesystems. But that is an issue with nosuid not with your enhanced
nosuid.




Now this patch 3/7 really should be entitled:
"Limit file caps to the userns of the super block".

It really really is doing something different. This change is about a
bottom up understanding of what file caps means on a filesystem mounted
by a user namespace root.

That is file caps should only apply to the user namespace root of the
root user who mounted the filesystem, because that is all the privileges
the mounter of the filesystem had.

This guarantees that even if the filesystem somehow propagates with
mount propagation that there will be no issues. I think I know how to
make that happen...




But deeply and fundamentally limiting a filesystem to only the
privilieges of it's user namespace root, and enhancing nosuid
protections are rather different things.


The approaches show up differently for dealing with uids and gids,
as mappings are required. The approaches will likely to continue to
show up differently for file caps when Serge implements a version
of file caps with a user namespace root in them.

The approaches fundamentally will need to do different things with
security xattrs. As mnt_may_suid can just treat as a filesystem
without labels, while ultimately the lsms will have to do something
meaningful.



So while in the very narrow case of todays file caps the two approaches
are the same. Enhancing nosuid is something very different from
limiting a filesystem to it's mounters user namespace.

Eric
Andy Lutomirski
2015-07-16 04:49:16 UTC
Permalink
On Wed, Jul 15, 2015 at 9:23 PM, Eric W. Biederman
Post by Eric W. Biederman
Ok. Andy I have stopped and really looked at your patch that is 4/7 in
this series. Something I had not done before since it sounded totally
wrong.
That combined with your earlier comments I think I can say something
meaningful.
Andy as I read your patch the thread you are primarily worried about is
chdir(/some/directory/in/another/mnt/ns). I think enhancing nosuid to
deal with that case is reasonable, and is unlikely to break userspace.
It is one of those hairy security things so we need to be careful not to
introduce a regression.
Indeed. It's plausible this could regress something, but it would be
really weird.
Post by Eric W. Biederman
I think a top down enhancement of nosuid to just block funny cases that
no one cares about is completely sensible. Removing goofy corner
that no one cares about and that are only good for security exploits
seems reasonable.
Agreed.
Post by Eric W. Biederman
I am a little concerned that smack does not seem to respect nosuid
on filesystems. But that is an issue with nosuid not with your enhanced
nosuid.
"Limit file caps to the userns of the super block".
It really really is doing something different. This change is about a
bottom up understanding of what file caps means on a filesystem mounted
by a user namespace root.
That is file caps should only apply to the user namespace root of the
root user who mounted the filesystem, because that is all the privileges
the mounter of the filesystem had.
This guarantees that even if the filesystem somehow propagates with
mount propagation that there will be no issues. I think I know how to
make that happen...
But deeply and fundamentally limiting a filesystem to only the
privilieges of it's user namespace root, and enhancing nosuid
protections are rather different things.
So here's the semantic question:

Suppose an unprivileged user (uid 1000) creates a user namespace and a
mount namespace. They stick a file (owned by uid 1000 as seen by
init_user_ns) in there and mark it setuid root and give it fcaps.

Then global root gets an fd to this filesystem. If they execve the
file directly, then, with my patch 4, it won't act as setuid 1000 and
the fcaps will be ignored. Even with my patch 4, though, if they bind
mount the fs and execve the file from their bind mount, it will act as
setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
fcaps will (correctly) not be honored.

I tend to thing that, if we're not honoring the fcaps, we shouldn't be
honoring the setuid bit either. After all, it's really not a trusted
file, even though the only user who could have messed with it really
is the apparent owner.

And, if we're going to say we don't trust the file and shouldn't honor
setuid or fcaps, then merging all the functionality into mnt_may_suid
could make sense. Yes, these two things do different things, but they
could hook in to the same place.

--Andy
Eric W. Biederman
2015-07-16 05:04:43 UTC
Permalink
Post by Andy Lutomirski
On Wed, Jul 15, 2015 at 9:23 PM, Eric W. Biederman
Post by Eric W. Biederman
Ok. Andy I have stopped and really looked at your patch that is 4/7 in
this series. Something I had not done before since it sounded totally
wrong.
That combined with your earlier comments I think I can say something
meaningful.
Andy as I read your patch the thread you are primarily worried about is
chdir(/some/directory/in/another/mnt/ns). I think enhancing nosuid to
deal with that case is reasonable, and is unlikely to break userspace.
It is one of those hairy security things so we need to be careful not to
introduce a regression.
Indeed. It's plausible this could regress something, but it would be
really weird.
Post by Eric W. Biederman
I think a top down enhancement of nosuid to just block funny cases that
no one cares about is completely sensible. Removing goofy corner
that no one cares about and that are only good for security exploits
seems reasonable.
Agreed.
Post by Eric W. Biederman
I am a little concerned that smack does not seem to respect nosuid
on filesystems. But that is an issue with nosuid not with your enhanced
nosuid.
"Limit file caps to the userns of the super block".
It really really is doing something different. This change is about a
bottom up understanding of what file caps means on a filesystem mounted
by a user namespace root.
That is file caps should only apply to the user namespace root of the
root user who mounted the filesystem, because that is all the privileges
the mounter of the filesystem had.
This guarantees that even if the filesystem somehow propagates with
mount propagation that there will be no issues. I think I know how to
make that happen...
But deeply and fundamentally limiting a filesystem to only the
privilieges of it's user namespace root, and enhancing nosuid
protections are rather different things.
Suppose an unprivileged user (uid 1000) creates a user namespace and a
mount namespace. They stick a file (owned by uid 1000 as seen by
init_user_ns) in there and mark it setuid root and give it fcaps.
To make this make sense I have to ask, is this file on a filesystem
where uid 1000 as seen by the init_user_ns stored as uid 1000 on
the filesystem? Or is this uid 0 as seen by the filesystem?

I assume this is uid 0 on the filesystem in question or else your
unprivileged user would not have sufficient privileges over the
filesystem to setup fcaps.
Post by Andy Lutomirski
Then global root gets an fd to this filesystem. If they execve the
file directly, then, with my patch 4, it won't act as setuid 1000 and
the fcaps will be ignored. Even with my patch 4, though, if they bind
mount the fs and execve the file from their bind mount, it will act as
setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
fcaps will (correctly) not be honored.
With patch 3 you can also think of it as fcaps being honored and you
get all the caps in the appropriate user namespace, but since you are
not in that user namespace and so don't have a place to store them
in struct cred you don't get the file caps.

From the philosophy of interpreting the file as defined by the
filesystem in principle we could extend struct cred so you actually
get the creds just in uid 1000s user namespace, but that is very
unlikely to be worth it.
Post by Andy Lutomirski
I tend to thing that, if we're not honoring the fcaps, we shouldn't be
honoring the setuid bit either. After all, it's really not a trusted
file, even though the only user who could have messed with it really
is the apparent owner.
For the file caps we can't honor them because you don't have the bits
in struct cred.

For setuid we can honor it, and setuid is something that the user
namespace allows.
Post by Andy Lutomirski
And, if we're going to say we don't trust the file and shouldn't honor
setuid or fcaps, then merging all the functionality into mnt_may_suid
could make sense. Yes, these two things do different things, but they
could hook in to the same place.
There are really two separate questions:
- Do we trust this filesystem?
- Do you have the bits to implement this concept?

Even if in this specific context the two questions wind up looking
exactly the same. I think it makes a lot of sense to ask the two
questions separately. As future maintenance changes may cause the
implementation of the questions to diverge.

Eric
Andy Lutomirski
2015-07-16 05:15:30 UTC
Permalink
On Wed, Jul 15, 2015 at 10:04 PM, Eric W. Biederman
Post by Eric W. Biederman
Post by Andy Lutomirski
On Wed, Jul 15, 2015 at 9:23 PM, Eric W. Biederman
Post by Eric W. Biederman
Ok. Andy I have stopped and really looked at your patch that is 4/7 in
this series. Something I had not done before since it sounded totally
wrong.
That combined with your earlier comments I think I can say something
meaningful.
Andy as I read your patch the thread you are primarily worried about is
chdir(/some/directory/in/another/mnt/ns). I think enhancing nosuid to
deal with that case is reasonable, and is unlikely to break userspace.
It is one of those hairy security things so we need to be careful not to
introduce a regression.
Indeed. It's plausible this could regress something, but it would be
really weird.
Post by Eric W. Biederman
I think a top down enhancement of nosuid to just block funny cases that
no one cares about is completely sensible. Removing goofy corner
that no one cares about and that are only good for security exploits
seems reasonable.
Agreed.
Post by Eric W. Biederman
I am a little concerned that smack does not seem to respect nosuid
on filesystems. But that is an issue with nosuid not with your enhanced
nosuid.
"Limit file caps to the userns of the super block".
It really really is doing something different. This change is about a
bottom up understanding of what file caps means on a filesystem mounted
by a user namespace root.
That is file caps should only apply to the user namespace root of the
root user who mounted the filesystem, because that is all the privileges
the mounter of the filesystem had.
This guarantees that even if the filesystem somehow propagates with
mount propagation that there will be no issues. I think I know how to
make that happen...
But deeply and fundamentally limiting a filesystem to only the
privilieges of it's user namespace root, and enhancing nosuid
protections are rather different things.
Suppose an unprivileged user (uid 1000) creates a user namespace and a
mount namespace. They stick a file (owned by uid 1000 as seen by
init_user_ns) in there and mark it setuid root and give it fcaps.
To make this make sense I have to ask, is this file on a filesystem
where uid 1000 as seen by the init_user_ns stored as uid 1000 on
the filesystem? Or is this uid 0 as seen by the filesystem?
I assume this is uid 0 on the filesystem in question or else your
unprivileged user would not have sufficient privileges over the
filesystem to setup fcaps.
I was thinking uid 0 as seen by the filesystem. But even if it were
uid 1000, the unprivileged user can still set whatever mode and xattrs
they want -- they control the backing store.
Post by Eric W. Biederman
Post by Andy Lutomirski
Then global root gets an fd to this filesystem. If they execve the
file directly, then, with my patch 4, it won't act as setuid 1000 and
the fcaps will be ignored. Even with my patch 4, though, if they bind
mount the fs and execve the file from their bind mount, it will act as
setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
fcaps will (correctly) not be honored.
With patch 3 you can also think of it as fcaps being honored and you
get all the caps in the appropriate user namespace, but since you are
not in that user namespace and so don't have a place to store them
in struct cred you don't get the file caps.
From the philosophy of interpreting the file as defined by the
filesystem in principle we could extend struct cred so you actually
get the creds just in uid 1000s user namespace, but that is very
unlikely to be worth it.
I agree.
Post by Eric W. Biederman
Post by Andy Lutomirski
I tend to thing that, if we're not honoring the fcaps, we shouldn't be
honoring the setuid bit either. After all, it's really not a trusted
file, even though the only user who could have messed with it really
is the apparent owner.
For the file caps we can't honor them because you don't have the bits
in struct cred.
For setuid we can honor it, and setuid is something that the user
namespace allows.
We certainly *can* honor it. But why should we? I'd be more
comfortable with this if the contents of an untrusted filesystem were
really treated as just data.
Post by Eric W. Biederman
Post by Andy Lutomirski
And, if we're going to say we don't trust the file and shouldn't honor
setuid or fcaps, then merging all the functionality into mnt_may_suid
could make sense. Yes, these two things do different things, but they
could hook in to the same place.
- Do we trust this filesystem?
- Do you have the bits to implement this concept?
Even if in this specific context the two questions wind up looking
exactly the same. I think it makes a lot of sense to ask the two
questions separately. As future maintenance changes may cause the
implementation of the questions to diverge.
Agreed.

Unless someone thinks of an argument to the contrary, I'd say "no, we
don't trust this filesystem". I could be convinced otherwise.

--Andy
Eric W. Biederman
2015-07-16 05:44:49 UTC
Permalink
Post by Andy Lutomirski
On Wed, Jul 15, 2015 at 10:04 PM, Eric W. Biederman
Post by Eric W. Biederman
Post by Andy Lutomirski
Suppose an unprivileged user (uid 1000) creates a user namespace and a
mount namespace. They stick a file (owned by uid 1000 as seen by
init_user_ns) in there and mark it setuid root and give it fcaps.
To make this make sense I have to ask, is this file on a filesystem
where uid 1000 as seen by the init_user_ns stored as uid 1000 on
the filesystem? Or is this uid 0 as seen by the filesystem?
I assume this is uid 0 on the filesystem in question or else your
unprivileged user would not have sufficient privileges over the
filesystem to setup fcaps.
I was thinking uid 0 as seen by the filesystem. But even if it were
uid 1000, the unprivileged user can still set whatever mode and xattrs
they want -- they control the backing store.
Yes. And that is what I was really asking. Are we taking about a
filesystem where the user controls the backing store?
Post by Andy Lutomirski
Post by Eric W. Biederman
Post by Andy Lutomirski
Then global root gets an fd to this filesystem. If they execve the
file directly, then, with my patch 4, it won't act as setuid 1000 and
the fcaps will be ignored. Even with my patch 4, though, if they bind
mount the fs and execve the file from their bind mount, it will act as
setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
fcaps will (correctly) not be honored.
With patch 3 you can also think of it as fcaps being honored and you
get all the caps in the appropriate user namespace, but since you are
not in that user namespace and so don't have a place to store them
in struct cred you don't get the file caps.
From the philosophy of interpreting the file as defined by the
filesystem in principle we could extend struct cred so you actually
get the creds just in uid 1000s user namespace, but that is very
unlikely to be worth it.
I agree.
Post by Eric W. Biederman
Post by Andy Lutomirski
I tend to thing that, if we're not honoring the fcaps, we shouldn't be
honoring the setuid bit either. After all, it's really not a trusted
file, even though the only user who could have messed with it really
is the apparent owner.
For the file caps we can't honor them because you don't have the bits
in struct cred.
For setuid we can honor it, and setuid is something that the user
namespace allows.
We certainly *can* honor it. But why should we? I'd be more
comfortable with this if the contents of an untrusted filesystem were
really treated as just data.
In these weird bleed through situtations I don't know that we should.
But extending nosuid protections in this way is a bit like yama
a bit gratuitious stomping don't care cases in the semantics to
make bugs harder to exploit.
Post by Andy Lutomirski
Post by Eric W. Biederman
Post by Andy Lutomirski
And, if we're going to say we don't trust the file and shouldn't honor
setuid or fcaps, then merging all the functionality into mnt_may_suid
could make sense. Yes, these two things do different things, but they
could hook in to the same place.
- Do we trust this filesystem?
- Do you have the bits to implement this concept?
Even if in this specific context the two questions wind up looking
exactly the same. I think it makes a lot of sense to ask the two
questions separately. As future maintenance changes may cause the
implementation of the questions to diverge.
Agreed.
Unless someone thinks of an argument to the contrary, I'd say "no, we
don't trust this filesystem". I could be convinced otherwise.
But this is context dependent. From the perspective of the container
we really do want to trust the filesystem. As the container root set it
up, and if he isn't being hostile likely has a use for setfcaps files
and setuid files and all of the rest.

Perhaps I should phrase it as:
- In this context do we trust the code? AKA mnt_may_suid?
- What do these bits mean in this context? (Usually something more complicated).

Which says to me we want both patches 3 and 4 (even if 4 uses s_user_ns)
because 3 is different than 4.

And now I better context switch back to fixing bind mounts.

Eric
Seth Forshee
2015-07-16 13:13:08 UTC
Permalink
Post by Eric W. Biederman
Post by Andy Lutomirski
On Wed, Jul 15, 2015 at 10:04 PM, Eric W. Biederman
Post by Eric W. Biederman
Post by Andy Lutomirski
Suppose an unprivileged user (uid 1000) creates a user namespace and a
mount namespace. They stick a file (owned by uid 1000 as seen by
init_user_ns) in there and mark it setuid root and give it fcaps.
To make this make sense I have to ask, is this file on a filesystem
where uid 1000 as seen by the init_user_ns stored as uid 1000 on
the filesystem? Or is this uid 0 as seen by the filesystem?
I assume this is uid 0 on the filesystem in question or else your
unprivileged user would not have sufficient privileges over the
filesystem to setup fcaps.
I was thinking uid 0 as seen by the filesystem. But even if it were
uid 1000, the unprivileged user can still set whatever mode and xattrs
they want -- they control the backing store.
Yes. And that is what I was really asking. Are we taking about a
filesystem where the user controls the backing store?
Post by Andy Lutomirski
Post by Eric W. Biederman
Post by Andy Lutomirski
Then global root gets an fd to this filesystem. If they execve the
file directly, then, with my patch 4, it won't act as setuid 1000 and
the fcaps will be ignored. Even with my patch 4, though, if they bind
mount the fs and execve the file from their bind mount, it will act as
setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
fcaps will (correctly) not be honored.
With patch 3 you can also think of it as fcaps being honored and you
get all the caps in the appropriate user namespace, but since you are
not in that user namespace and so don't have a place to store them
in struct cred you don't get the file caps.
From the philosophy of interpreting the file as defined by the
filesystem in principle we could extend struct cred so you actually
get the creds just in uid 1000s user namespace, but that is very
unlikely to be worth it.
I agree.
Post by Eric W. Biederman
Post by Andy Lutomirski
I tend to thing that, if we're not honoring the fcaps, we shouldn't be
honoring the setuid bit either. After all, it's really not a trusted
file, even though the only user who could have messed with it really
is the apparent owner.
For the file caps we can't honor them because you don't have the bits
in struct cred.
For setuid we can honor it, and setuid is something that the user
namespace allows.
We certainly *can* honor it. But why should we? I'd be more
comfortable with this if the contents of an untrusted filesystem were
really treated as just data.
In these weird bleed through situtations I don't know that we should.
But extending nosuid protections in this way is a bit like yama
a bit gratuitious stomping don't care cases in the semantics to
make bugs harder to exploit.
Post by Andy Lutomirski
Post by Eric W. Biederman
Post by Andy Lutomirski
And, if we're going to say we don't trust the file and shouldn't honor
setuid or fcaps, then merging all the functionality into mnt_may_suid
could make sense. Yes, these two things do different things, but they
could hook in to the same place.
- Do we trust this filesystem?
- Do you have the bits to implement this concept?
Even if in this specific context the two questions wind up looking
exactly the same. I think it makes a lot of sense to ask the two
questions separately. As future maintenance changes may cause the
implementation of the questions to diverge.
Agreed.
Unless someone thinks of an argument to the contrary, I'd say "no, we
don't trust this filesystem". I could be convinced otherwise.
But this is context dependent. From the perspective of the container
we really do want to trust the filesystem. As the container root set it
up, and if he isn't being hostile likely has a use for setfcaps files
and setuid files and all of the rest.
- In this context do we trust the code? AKA mnt_may_suid?
- What do these bits mean in this context? (Usually something more complicated).
Which says to me we want both patches 3 and 4 (even if 4 uses s_user_ns)
because 3 is different than 4.
So what I'll do is:

- Add a s_user_ns check to mnt_may_suid
- Keep the (now redundant) s_user_ns check in get_file_caps

I'm on the fence about having both the mnt and user ns checks in
mnt_may_suid - it might be overkill, but it still adds the protection
against clearing MNT_NOSUID in a bind mount. So I guess I'll keep the
mnt ns check.

Seth
Eric W. Biederman
2015-07-17 00:43:24 UTC
Permalink
Post by Seth Forshee
Post by Eric W. Biederman
Post by Andy Lutomirski
On Wed, Jul 15, 2015 at 10:04 PM, Eric W. Biederman
Post by Eric W. Biederman
Post by Andy Lutomirski
Suppose an unprivileged user (uid 1000) creates a user namespace and a
mount namespace. They stick a file (owned by uid 1000 as seen by
init_user_ns) in there and mark it setuid root and give it fcaps.
To make this make sense I have to ask, is this file on a filesystem
where uid 1000 as seen by the init_user_ns stored as uid 1000 on
the filesystem? Or is this uid 0 as seen by the filesystem?
I assume this is uid 0 on the filesystem in question or else your
unprivileged user would not have sufficient privileges over the
filesystem to setup fcaps.
I was thinking uid 0 as seen by the filesystem. But even if it were
uid 1000, the unprivileged user can still set whatever mode and xattrs
they want -- they control the backing store.
Yes. And that is what I was really asking. Are we taking about a
filesystem where the user controls the backing store?
Post by Andy Lutomirski
Post by Eric W. Biederman
Post by Andy Lutomirski
Then global root gets an fd to this filesystem. If they execve the
file directly, then, with my patch 4, it won't act as setuid 1000 and
the fcaps will be ignored. Even with my patch 4, though, if they bind
mount the fs and execve the file from their bind mount, it will act as
setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
fcaps will (correctly) not be honored.
With patch 3 you can also think of it as fcaps being honored and you
get all the caps in the appropriate user namespace, but since you are
not in that user namespace and so don't have a place to store them
in struct cred you don't get the file caps.
From the philosophy of interpreting the file as defined by the
filesystem in principle we could extend struct cred so you actually
get the creds just in uid 1000s user namespace, but that is very
unlikely to be worth it.
I agree.
Post by Eric W. Biederman
Post by Andy Lutomirski
I tend to thing that, if we're not honoring the fcaps, we shouldn't be
honoring the setuid bit either. After all, it's really not a trusted
file, even though the only user who could have messed with it really
is the apparent owner.
For the file caps we can't honor them because you don't have the bits
in struct cred.
For setuid we can honor it, and setuid is something that the user
namespace allows.
We certainly *can* honor it. But why should we? I'd be more
comfortable with this if the contents of an untrusted filesystem were
really treated as just data.
In these weird bleed through situtations I don't know that we should.
But extending nosuid protections in this way is a bit like yama
a bit gratuitious stomping don't care cases in the semantics to
make bugs harder to exploit.
Post by Andy Lutomirski
Post by Eric W. Biederman
Post by Andy Lutomirski
And, if we're going to say we don't trust the file and shouldn't honor
setuid or fcaps, then merging all the functionality into mnt_may_suid
could make sense. Yes, these two things do different things, but they
could hook in to the same place.
- Do we trust this filesystem?
- Do you have the bits to implement this concept?
Even if in this specific context the two questions wind up looking
exactly the same. I think it makes a lot of sense to ask the two
questions separately. As future maintenance changes may cause the
implementation of the questions to diverge.
Agreed.
Unless someone thinks of an argument to the contrary, I'd say "no, we
don't trust this filesystem". I could be convinced otherwise.
But this is context dependent. From the perspective of the container
we really do want to trust the filesystem. As the container root set it
up, and if he isn't being hostile likely has a use for setfcaps files
and setuid files and all of the rest.
- In this context do we trust the code? AKA mnt_may_suid?
- What do these bits mean in this context? (Usually something more complicated).
Which says to me we want both patches 3 and 4 (even if 4 uses s_user_ns)
because 3 is different than 4.
- Add a s_user_ns check to mnt_may_suid
- Keep the (now redundant) s_user_ns check in get_file_caps
I'm on the fence about having both the mnt and user ns checks in
mnt_may_suid - it might be overkill, but it still adds the protection
against clearing MNT_NOSUID in a bind mount. So I guess I'll keep the
mnt ns check.
That sounds like a plan.

Eric
Serge E. Hallyn
2015-07-29 16:04:50 UTC
Permalink
Post by Eric W. Biederman
Post by Andy Lutomirski
I tend to thing that, if we're not honoring the fcaps, we shouldn't be
honoring the setuid bit either. After all, it's really not a trusted
file, even though the only user who could have messed with it really
is the apparent owner.
For the file caps we can't honor them because you don't have the bits
in struct cred.
For setuid we can honor it, and setuid is something that the user
namespace allows.
Setuid is something explicitly tied to the user id. File capabilities
are MAC, that is, explicitly orthogonal to user id. So 100% agreed with
honoring setuid in user_ns and, for now, ignoring file caps.

As I've mentioned a few times privately, I'm intending to implement
user-namespaced file capabilities as a new xattr. Design is not 100%
nailed down, but probably it would support a set of userns_fcaps, each
of which lists the k_uid of the root user in the namespace assigning the
filecaps, followed by three sets. Then when exec()ing the file, if
the current->userns->root user has a userns_fcap entry, or there is a -1
entry, then use that, else use nothing. I think this is a very importing
thing to support, to remove a barrier to shipping packages with software
using filecaps. Without this, any package, say ping, which wants to
support being installed in a (unprivileged) cotainer would need to also
support use without filecaps, meaning that will likely be the only
supported mode.

-serge
Serge E. Hallyn
2015-07-29 16:18:06 UTC
Permalink
Post by Serge E. Hallyn
Post by Eric W. Biederman
Post by Andy Lutomirski
I tend to thing that, if we're not honoring the fcaps, we shouldn't be
honoring the setuid bit either. After all, it's really not a trusted
file, even though the only user who could have messed with it really
is the apparent owner.
For the file caps we can't honor them because you don't have the bits
in struct cred.
For setuid we can honor it, and setuid is something that the user
namespace allows.
Setuid is something explicitly tied to the user id. File capabilities
are MAC, that is, explicitly orthogonal to user id. So 100% agreed with
honoring setuid in user_ns and, for now, ignoring file caps.
Hm. No. Seems like both should be fine when current is in the mounter's
user_ns, and ignored otherwise.

(The below is still needed :)
Post by Serge E. Hallyn
As I've mentioned a few times privately, I'm intending to implement
user-namespaced file capabilities as a new xattr. Design is not 100%
nailed down, but probably it would support a set of userns_fcaps, each
of which lists the k_uid of the root user in the namespace assigning the
filecaps, followed by three sets. Then when exec()ing the file, if
the current->userns->root user has a userns_fcap entry, or there is a -1
entry, then use that, else use nothing. I think this is a very importing
thing to support, to remove a barrier to shipping packages with software
using filecaps. Without this, any package, say ping, which wants to
support being installed in a (unprivileged) cotainer would need to also
support use without filecaps, meaning that will likely be the only
supported mode.
-serge
Seth Forshee
2015-07-15 19:46:05 UTC
Permalink
From: Andy Lutomirski <***@amacapital.net>

If a process gets access to a mount from a different namespace user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem.
Technically, trusting mounts created by the same or ancestor user
namespaces ought to be safe, but it's simpler to distrust all
foreign mounts.

This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.

This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.

As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.

On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.

As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.

Signed-off-by: Andy Lutomirski <***@amacapital.net>
[ saf: Forward ported to 4.2 ]
Signed-off-by: Seth Forshee <***@canonical.com>
---
fs/exec.c | 2 +-
fs/namespace.c | 13 +++++++++++++
include/linux/mount.h | 1 +
security/commoncap.c | 2 +-
security/selinux/hooks.c | 2 +-
5 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index b06623a9347f..ea7311d72cc3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1295,7 +1295,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
bprm->cred->euid = current_euid();
bprm->cred->egid = current_egid();

- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return;

if (task_no_new_privs(current))
diff --git a/fs/namespace.c b/fs/namespace.c
index 423001de32a2..2bfd7ca92247 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3252,6 +3252,19 @@ found:
return visible;
}

+bool mnt_may_suid(struct vfsmount *mnt)
+{
+ /*
+ * Foreign mounts (accessed via fchdir or through /proc
+ * symlinks) are always treated as if they are nosuid. This
+ * prevents namespaces from trusting potentially unsafe
+ * suid/sgid bits, file caps, or security labels that originate
+ * in other namespaces.
+ */
+ return real_mount(mnt)->mnt_ns == current->nsproxy->mnt_ns &&
+ !(mnt->mnt_flags & MNT_NOSUID);
+}
+
static struct ns_common *mntns_get(struct task_struct *task)
{
struct ns_common *ns = NULL;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c11377..54a594d49733 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt);
extern struct vfsmount *mntget(struct vfsmount *mnt);
extern struct vfsmount *mnt_clone_internal(struct path *path);
extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern bool mnt_may_suid(struct vfsmount *mnt);

struct path;
extern struct vfsmount *clone_private_mount(struct path *path);
diff --git a/security/commoncap.c b/security/commoncap.c
index 175ab497e810..858d86a1b73c 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -437,7 +437,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
if (!file_caps_enabled)
return 0;

- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return 0;
if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
return 0;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c5c49d..459e71ddbc9d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2137,7 +2137,7 @@ static int check_nnp_nosuid(const struct linux_binprm *bprm,
const struct task_security_struct *new_tsec)
{
int nnp = (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS);
- int nosuid = (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID);
+ int nosuid = !mnt_may_suid(bprm->file->f_path.mnt);
int rc;

if (!nnp && !nosuid)
--
1.9.1
Nikolay Borisov
2015-07-17 06:46:12 UTC
Permalink
Post by Seth Forshee
If a process gets access to a mount from a different namespace user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem.
Technically, trusting mounts created by the same or ancestor user
namespaces ought to be safe, but it's simpler to distrust all
foreign mounts.
This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.
This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.
As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.
On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.
As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.
[ saf: Forward ported to 4.2 ]
---
fs/exec.c | 2 +-
fs/namespace.c | 13 +++++++++++++
include/linux/mount.h | 1 +
security/commoncap.c | 2 +-
security/selinux/hooks.c | 2 +-
5 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index b06623a9347f..ea7311d72cc3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1295,7 +1295,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
bprm->cred->euid = current_euid();
bprm->cred->egid = current_egid();
- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return;
if (task_no_new_privs(current))
diff --git a/fs/namespace.c b/fs/namespace.c
index 423001de32a2..2bfd7ca92247 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
return visible;
}
+bool mnt_may_suid(struct vfsmount *mnt)
+{
+ /*
+ * Foreign mounts (accessed via fchdir or through /proc
+ * symlinks) are always treated as if they are nosuid. This
+ * prevents namespaces from trusting potentially unsafe
+ * suid/sgid bits, file caps, or security labels that originate
+ * in other namespaces.
+ */
+ return real_mount(mnt)->mnt_ns == current->nsproxy->mnt_ns &&
+ !(mnt->mnt_flags & MNT_NOSUID);
Maybe check_mnt() from fs/namespace.c can be exported and used here,
instead of open coding it.
Post by Seth Forshee
+}
+
static struct ns_common *mntns_get(struct task_struct *task)
{
struct ns_common *ns = NULL;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c11377..54a594d49733 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt);
extern struct vfsmount *mntget(struct vfsmount *mnt);
extern struct vfsmount *mnt_clone_internal(struct path *path);
extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern bool mnt_may_suid(struct vfsmount *mnt);
struct path;
extern struct vfsmount *clone_private_mount(struct path *path);
diff --git a/security/commoncap.c b/security/commoncap.c
index 175ab497e810..858d86a1b73c 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -437,7 +437,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
if (!file_caps_enabled)
return 0;
- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return 0;
if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
return 0;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c5c49d..459e71ddbc9d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2137,7 +2137,7 @@ static int check_nnp_nosuid(const struct linux_binprm *bprm,
const struct task_security_struct *new_tsec)
{
int nnp = (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS);
- int nosuid = (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID);
+ int nosuid = !mnt_may_suid(bprm->file->f_path.mnt);
int rc;
if (!nnp && !nosuid)
Seth Forshee
2015-07-15 19:46:06 UTC
Permalink
Respecting security labels for mounts from user namespaces may
allow unprivileged users to introduce security labels into the
system. To stop this from happening prevent calling the
inode_post_setxattr, inode_setsecurity, inode_notifysecctx, and
inode_setsecctx hooks when s_user_ns != init_user_ns. There's no
purpose in actually blocking setting of these xattrs, as (for rw
mounts at least) the user must have write access to the
underlying filesystem and could set the xattrs by other means.

Signed-off-by: Seth Forshee <***@canonical.com>
---
security/security.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..980710baa8f9 100644
--- a/security/security.c
+++ b/security/security.c
@@ -653,7 +653,9 @@ void security_inode_post_setxattr(struct dentry *dentry, const char *name,
{
if (unlikely(IS_PRIVATE(d_backing_inode(dentry))))
return;
- call_void_hook(inode_post_setxattr, dentry, name, value, size, flags);
+ if (dentry->d_inode->i_sb->s_user_ns == &init_user_ns)
+ call_void_hook(inode_post_setxattr, dentry, name, value, size,
+ flags);
evm_inode_post_setxattr(dentry, name, value, size);
}

@@ -712,6 +714,8 @@ int security_inode_getsecurity(const struct inode *inode, const char *name, void

int security_inode_setsecurity(struct inode *inode, const char *name, const void *value, size_t size, int flags)
{
+ if (inode->i_sb->s_user_ns != &init_user_ns)
+ return -EOPNOTSUPP;
if (unlikely(IS_PRIVATE(inode)))
return -EOPNOTSUPP;
return call_int_hook(inode_setsecurity, -EOPNOTSUPP, inode, name,
@@ -1168,12 +1172,16 @@ EXPORT_SYMBOL(security_release_secctx);

int security_inode_notifysecctx(struct inode *inode, void *ctx, u32 ctxlen)
{
+ if (inode->i_sb->s_user_ns != &init_user_ns)
+ return -EOPNOTSUPP;
return call_int_hook(inode_notifysecctx, 0, inode, ctx, ctxlen);
}
EXPORT_SYMBOL(security_inode_notifysecctx);

int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen)
{
+ if (dentry->d_inode->i_sb->s_user_ns != &init_user_ns)
+ return -EOPNOTSUPP;
return call_int_hook(inode_setsecctx, 0, dentry, ctx, ctxlen);
}
EXPORT_SYMBOL(security_inode_setsecctx);
--
1.9.1
Seth Forshee
2015-07-15 19:46:07 UTC
Permalink
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.

Signed-off-by: Seth Forshee <***@canonical.com>
---
security/selinux/hooks.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 459e71ddbc9d..eeb71e45ab82 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -732,6 +732,19 @@ static int selinux_set_mnt_opts(struct super_block *sb,
!strcmp(sb->s_type->name, "pstore"))
sbsec->flags |= SE_SBGENFS;

+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels mus be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid)
+ return -EPERM;
+ sbsec->behavior = SECURITY_FS_USE_NONE;
+ goto out_set_opts;
+ }
+
+
if (!sbsec->behavior) {
/*
* Determine the labeling behavior to use for this
@@ -813,6 +826,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}

+out_set_opts:
rc = sb_finish_set_opts(sb);
out:
mutex_unlock(&sbsec->lock);
--
1.9.1
Stephen Smalley
2015-07-16 13:23:33 UTC
Permalink
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
Post by Seth Forshee
---
security/selinux/hooks.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 459e71ddbc9d..eeb71e45ab82 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -732,6 +732,19 @@ static int selinux_set_mnt_opts(struct super_block *sb,
!strcmp(sb->s_type->name, "pstore"))
sbsec->flags |= SE_SBGENFS;
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels mus be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid)
+ return -EPERM;
+ sbsec->behavior = SECURITY_FS_USE_NONE;
+ goto out_set_opts;
+ }
+
+
if (!sbsec->behavior) {
/*
* Determine the labeling behavior to use for this
@@ -813,6 +826,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}
rc = sb_finish_set_opts(sb);
mutex_unlock(&sbsec->lock);
Stephen Smalley
2015-07-22 16:02:13 UTC
Permalink
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Post by Stephen Smalley
Post by Seth Forshee
---
security/selinux/hooks.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 459e71ddbc9d..eeb71e45ab82 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -732,6 +732,19 @@ static int selinux_set_mnt_opts(struct super_block *sb,
!strcmp(sb->s_type->name, "pstore"))
sbsec->flags |= SE_SBGENFS;
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels mus be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid)
+ return -EPERM;
+ sbsec->behavior = SECURITY_FS_USE_NONE;
+ goto out_set_opts;
+ }
+
+
if (!sbsec->behavior) {
/*
* Determine the labeling behavior to use for this
@@ -813,6 +826,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}
rc = sb_finish_set_opts(sb);
mutex_unlock(&sbsec->lock);
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Seth Forshee
2015-07-22 16:14:22 UTC
Permalink
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.

Seth
Stephen Smalley
2015-07-22 20:25:22 UTC
Permalink
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.
Not tested, but something like this should work. Note that it should
come after the call to security_fs_use() so we know whether SELinux
would even try to use xattrs supplied by the filesystem in the first place.

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c..84da3a2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ struct block_device *bdev = sb->s_bdev;
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ if (bdev) {
+ struct inode_security_struct *isec =
bdev->bd_inode;
+ sbsec->mntpoint_sid = isec->sid;
+ } else {
+ sbsec->mntpoint_sid = current_sid();
+ }
+ }
+ goto out_set_opts;
+ }
+
/* sets the context of the superblock for the fs being mounted. */
if (fscontext_sid) {
rc = may_context_mount_sb_relabel(fscontext_sid, sbsec,
cred);
@@ -813,6 +837,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}

+out_set_opts:
rc = sb_finish_set_opts(sb);
out:
mutex_unlock(&sbsec->lock);
Stephen Smalley
2015-07-22 20:40:29 UTC
Permalink
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.
Not tested, but something like this should work. Note that it should
come after the call to security_fs_use() so we know whether SELinux
would even try to use xattrs supplied by the filesystem in the first place.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c..84da3a2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ struct block_device *bdev = sb->s_bdev;
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ if (bdev) {
+ struct inode_security_struct *isec =
bdev->bd_inode;
That should be bdev->bd_inode->i_security.
Post by Stephen Smalley
+ sbsec->mntpoint_sid = isec->sid;
+ } else {
+ sbsec->mntpoint_sid = current_sid();
+ }
+ }
+ goto out_set_opts;
+ }
+
/* sets the context of the superblock for the fs being mounted. */
if (fscontext_sid) {
rc = may_context_mount_sb_relabel(fscontext_sid, sbsec,
cred);
@@ -813,6 +837,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}
rc = sb_finish_set_opts(sb);
mutex_unlock(&sbsec->lock);
_______________________________________________
Selinux mailing list
Stephen Smalley
2015-07-23 13:57:20 UTC
Permalink
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.
Not tested, but something like this should work. Note that it should
come after the call to security_fs_use() so we know whether SELinux
would even try to use xattrs supplied by the filesystem in the first place.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c..84da3a2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ struct block_device *bdev = sb->s_bdev;
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ if (bdev) {
+ struct inode_security_struct *isec =
bdev->bd_inode;
That should be bdev->bd_inode->i_security.
Sorry, this won't work. bd_inode is not the inode of the block device
file that was passed to mount, and it isn't labeled in any way. It will
just be unlabeled.

So I guess the only real option here as a fallback is
sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
only case where we currently assign task labels to files is for their
/proc/pid inodes, and no current policy will therefore allow create
permission to such files.
Post by Stephen Smalley
Post by Stephen Smalley
+ sbsec->mntpoint_sid = isec->sid;
+ } else {
+ sbsec->mntpoint_sid = current_sid();
+ }
+ }
+ goto out_set_opts;
+ }
+
/* sets the context of the superblock for the fs being mounted. */
if (fscontext_sid) {
rc = may_context_mount_sb_relabel(fscontext_sid, sbsec,
cred);
@@ -813,6 +837,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}
rc = sb_finish_set_opts(sb);
mutex_unlock(&sbsec->lock);
_______________________________________________
Selinux mailing list
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Seth Forshee
2015-07-23 14:39:20 UTC
Permalink
Post by Stephen Smalley
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.
Not tested, but something like this should work. Note that it should
come after the call to security_fs_use() so we know whether SELinux
would even try to use xattrs supplied by the filesystem in the first place.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c..84da3a2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ struct block_device *bdev = sb->s_bdev;
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ if (bdev) {
+ struct inode_security_struct *isec =
bdev->bd_inode;
That should be bdev->bd_inode->i_security.
Sorry, this won't work. bd_inode is not the inode of the block device
file that was passed to mount, and it isn't labeled in any way. It will
just be unlabeled.
So I guess the only real option here as a fallback is
sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
only case where we currently assign task labels to files is for their
/proc/pid inodes, and no current policy will therefore allow create
permission to such files.
Darn, you're right, that isn't the inode we want. There really doesn't
seem to be any way to get back to the one we want from the LSM, short of
adding a new hook.

Seth
Stephen Smalley
2015-07-23 15:36:03 UTC
Permalink
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.
Not tested, but something like this should work. Note that it should
come after the call to security_fs_use() so we know whether SELinux
would even try to use xattrs supplied by the filesystem in the first place.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c..84da3a2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ struct block_device *bdev = sb->s_bdev;
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ if (bdev) {
+ struct inode_security_struct *isec =
bdev->bd_inode;
That should be bdev->bd_inode->i_security.
Sorry, this won't work. bd_inode is not the inode of the block device
file that was passed to mount, and it isn't labeled in any way. It will
just be unlabeled.
So I guess the only real option here as a fallback is
sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
only case where we currently assign task labels to files is for their
/proc/pid inodes, and no current policy will therefore allow create
permission to such files.
Darn, you're right, that isn't the inode we want. There really doesn't
seem to be any way to get back to the one we want from the LSM, short of
adding a new hook.
Maybe list_first_entry(&sb->s_bdev->bd_inodes, struct inode, i_devices)?
Feels like a layering violation though...
Seth Forshee
2015-07-23 16:23:31 UTC
Permalink
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.
Not tested, but something like this should work. Note that it should
come after the call to security_fs_use() so we know whether SELinux
would even try to use xattrs supplied by the filesystem in the first place.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c..84da3a2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ struct block_device *bdev = sb->s_bdev;
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ if (bdev) {
+ struct inode_security_struct *isec =
bdev->bd_inode;
That should be bdev->bd_inode->i_security.
Sorry, this won't work. bd_inode is not the inode of the block device
file that was passed to mount, and it isn't labeled in any way. It will
just be unlabeled.
So I guess the only real option here as a fallback is
sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
only case where we currently assign task labels to files is for their
/proc/pid inodes, and no current policy will therefore allow create
permission to such files.
Darn, you're right, that isn't the inode we want. There really doesn't
seem to be any way to get back to the one we want from the LSM, short of
adding a new hook.
Maybe list_first_entry(&sb->s_bdev->bd_inodes, struct inode, i_devices)?
Feels like a layering violation though...
Yeah, and even though that probably works out to be the inode we want in
most cases I don't think we can be absolutely certain that it is. Maybe
there's some way we could walk the list and be sure we've found the
right inode, but I'm not seeing it.
Seth Forshee
2015-07-24 15:11:37 UTC
Permalink
Post by Seth Forshee
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.
Not tested, but something like this should work. Note that it should
come after the call to security_fs_use() so we know whether SELinux
would even try to use xattrs supplied by the filesystem in the first place.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c..84da3a2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ struct block_device *bdev = sb->s_bdev;
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ if (bdev) {
+ struct inode_security_struct *isec =
bdev->bd_inode;
That should be bdev->bd_inode->i_security.
Sorry, this won't work. bd_inode is not the inode of the block device
file that was passed to mount, and it isn't labeled in any way. It will
just be unlabeled.
So I guess the only real option here as a fallback is
sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
only case where we currently assign task labels to files is for their
/proc/pid inodes, and no current policy will therefore allow create
permission to such files.
Darn, you're right, that isn't the inode we want. There really doesn't
seem to be any way to get back to the one we want from the LSM, short of
adding a new hook.
Maybe list_first_entry(&sb->s_bdev->bd_inodes, struct inode, i_devices)?
Feels like a layering violation though...
Yeah, and even though that probably works out to be the inode we want in
most cases I don't think we can be absolutely certain that it is. Maybe
there's some way we could walk the list and be sure we've found the
right inode, but I'm not seeing it.
I guess we could do something like this (note that most of the changes
here are just to give a version of blkdev_get_by_path which takes a
struct path * so that the filename lookup doesn't have to be done
twice). Basically add a new hook that informs the security module of the
inode for the backing device file passed to mount and call that from
mount_bdev. The security module could grab a reference to the inode and
stash it away.

Something else to note is that, as I have it here, the hook would end up
getting called for every mount of a given block device, not just the
first. So it's possible the security module could see the hook called a
second time with a different inode that has a different label. The hook
could be changed to return int if you wanted to have the opportunity to
reject such mounts.

Seth

---

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f8ce371c437c..dc2173e24e30 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1372,14 +1372,39 @@ int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder)
}
EXPORT_SYMBOL(blkdev_get);

+static struct block_device *__lookup_bdev(struct path *path);
+
+struct block_device * __blkdev_get_by_path(struct path *path, fmode_t mode,
+ void *holder)
+{
+ struct block_device *bdev;
+ int err;
+
+ bdev = __lookup_bdev(path);
+ if (IS_ERR(bdev))
+ return bdev;
+
+ err = blkdev_get(bdev, mode, holder);
+ if (err)
+ return ERR_PTR(err);
+
+ if ((mode & FMODE_WRITE) && bdev_read_only(bdev)) {
+ blkdev_put(bdev, mode);
+ return ERR_PTR(-EACCES);
+ }
+
+ return bdev;
+}
+EXPORT_SYMBOL(__blkdev_get_by_path);
+
/**
* blkdev_get_by_path - open a block device by name
- * @path: path to the block device to open
+ * @pathname: path to the block device to open
* @mode: FMODE_* mask
* @holder: exclusive holder identifier
*
- * Open the blockdevice described by the device file at @path. @mode
- * and @holder are identical to blkdev_get().
+ * Open the blockdevice described by the device file at @pathname.
+ * @mode and @holder are identical to blkdev_get().
*
* On success, the returned block_device has reference count of one.
*
@@ -1389,25 +1414,22 @@ EXPORT_SYMBOL(blkdev_get);
* RETURNS:
* Pointer to block_device on success, ERR_PTR(-errno) on failure.
*/
-struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
+struct block_device *blkdev_get_by_path(const char *pathname, fmode_t mode,
void *holder)
{
struct block_device *bdev;
- int err;
-
- bdev = lookup_bdev(path);
- if (IS_ERR(bdev))
- return bdev;
+ struct path path;
+ int error;

- err = blkdev_get(bdev, mode, holder);
- if (err)
- return ERR_PTR(err);
+ if (!pathname || !*pathname)
+ return ERR_PTR(-EINVAL);

- if ((mode & FMODE_WRITE) && bdev_read_only(bdev)) {
- blkdev_put(bdev, mode);
- return ERR_PTR(-EACCES);
- }
+ error = kern_path(pathname, LOOKUP_FOLLOW, &path);
+ if (error)
+ return ERR_PTR(error);

+ bdev = __blkdev_get_by_path(&path, mode, holder);
+ path_put(&path);
return bdev;
}
EXPORT_SYMBOL(blkdev_get_by_path);
@@ -1702,6 +1724,30 @@ int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)

EXPORT_SYMBOL(ioctl_by_bdev);

+static struct block_device *__lookup_bdev(struct path *path)
+{
+ struct block_device *bdev;
+ struct inode *inode;
+ int error;
+
+ inode = d_backing_inode(path->dentry);
+ error = -ENOTBLK;
+ if (!S_ISBLK(inode->i_mode))
+ goto fail;
+ error = -EACCES;
+ if (!may_open_dev(path))
+ goto fail;
+ error = -ENOMEM;
+ bdev = bd_acquire(inode);
+ if (!bdev)
+ goto fail;
+out:
+ return bdev;
+fail:
+ bdev = ERR_PTR(error);
+ goto out;
+}
+
/**
* lookup_bdev - lookup a struct block_device by name
* @pathname: special file representing the block device
@@ -1713,7 +1759,6 @@ EXPORT_SYMBOL(ioctl_by_bdev);
struct block_device *lookup_bdev(const char *pathname)
{
struct block_device *bdev;
- struct inode *inode;
struct path path;
int error;

@@ -1724,23 +1769,9 @@ struct block_device *lookup_bdev(const char *pathname)
if (error)
return ERR_PTR(error);

- inode = d_backing_inode(path.dentry);
- error = -ENOTBLK;
- if (!S_ISBLK(inode->i_mode))
- goto fail;
- error = -EACCES;
- if (!may_open_dev(&path))
- goto fail;
- error = -ENOMEM;
- bdev = bd_acquire(inode);
- if (!bdev)
- goto fail;
-out:
+ bdev = __lookup_bdev(&path);
path_put(&path);
return bdev;
-fail:
- bdev = ERR_PTR(error);
- goto out;
}
EXPORT_SYMBOL(lookup_bdev);

diff --git a/fs/super.c b/fs/super.c
index 008f938e3ec0..558f7845a171 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,7 @@
#include <linux/fsnotify.h>
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
+#include <linux/namei.h>
#include "internal.h"


@@ -980,15 +981,26 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
{
struct block_device *bdev;
struct super_block *s;
+ struct path path;
+ struct inode *inode;
fmode_t mode = FMODE_READ | FMODE_EXCL;
int error = 0;

if (!(flags & MS_RDONLY))
mode |= FMODE_WRITE;

- bdev = blkdev_get_by_path(dev_name, mode, fs_type);
- if (IS_ERR(bdev))
- return ERR_CAST(bdev);
+ if (!dev_name || !*dev_name)
+ return ERR_PTR(-EINVAL);
+
+ error = kern_path(dev_name, LOOKUP_FOLLOW, &path);
+ if (error)
+ return ERR_PTR(error);
+
+ bdev = __blkdev_get_by_path(&path, mode, fs_type);
+ if (IS_ERR(bdev)) {
+ error = PTR_ERR(bdev);
+ goto error;
+ }

/*
* once the super is inserted into the list by sget, s_umount
@@ -1040,6 +1052,10 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
bdev->bd_super = s;
}

+ inode = d_backing_inode(path.dentry);
+ security_sb_backing_dev(s, inode);
+ path_put(&path);
+
return dget(s->s_root);

error_s:
@@ -1047,6 +1063,7 @@ error_s:
error_bdev:
blkdev_put(bdev, mode);
error:
+ path_put(&path);
return ERR_PTR(error);
}
EXPORT_SYMBOL(mount_bdev);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4597420ab933..3748945bf0d5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2315,6 +2315,8 @@ extern int ioctl_by_bdev(struct block_device *, unsigned, unsigned long);
extern int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
extern long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
extern int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder);
+extern struct block_device *__blkdev_get_by_path(struct path *path, fmode_t mode,
+ void *holder);
extern struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
void *holder);
extern struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode,
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 9429f054c323..52ce1a094e04 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1351,6 +1351,7 @@ union security_list_options {
int (*sb_clone_mnt_opts)(const struct super_block *oldsb,
struct super_block *newsb);
int (*sb_parse_opts_str)(char *options, struct security_mnt_opts *opts);
+ void (*sb_backing_dev)(struct super_block *sb, struct inode *inode);
int (*dentry_init_security)(struct dentry *dentry, int mode,
struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -1648,6 +1649,7 @@ struct security_hook_heads {
struct list_head sb_set_mnt_opts;
struct list_head sb_clone_mnt_opts;
struct list_head sb_parse_opts_str;
+ struct list_head sb_backing_dev;
struct list_head dentry_init_security;
#ifdef CONFIG_SECURITY_PATH
struct list_head path_unlink;
diff --git a/include/linux/security.h b/include/linux/security.h
index 79d85ddf8093..7a4d8382af20 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -231,6 +231,7 @@ int security_sb_set_mnt_opts(struct super_block *sb,
int security_sb_clone_mnt_opts(const struct super_block *oldsb,
struct super_block *newsb);
int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts);
+void security_sb_backing_dev(struct super_block *sb, struct inode *inode);
int security_dentry_init_security(struct dentry *dentry, int mode,
struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -562,6 +563,10 @@ static inline int security_sb_parse_opts_str(char *options, struct security_mnt_
return 0;
}

+static inline void security_sb_backing_dev(struct super_block *sb,
+ struct inode *inode)
+{ }
+
static inline int security_inode_alloc(struct inode *inode)
{
return 0;
diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..f6f89e0f06d8 100644
--- a/security/security.c
+++ b/security/security.c
@@ -347,6 +347,11 @@ int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts)
}
EXPORT_SYMBOL(security_sb_parse_opts_str);

+void security_sb_backing_dev(struct super_block *sb, struct inode *inode)
+{
+ call_void_hook(sb_backing_dev, sb, inode);
+}
+
int security_inode_alloc(struct inode *inode)
{
inode->i_security = NULL;
@@ -1595,6 +1600,8 @@ struct security_hook_heads security_hook_heads = {
LIST_HEAD_INIT(security_hook_heads.sb_clone_mnt_opts),
.sb_parse_opts_str =
LIST_HEAD_INIT(security_hook_heads.sb_parse_opts_str),
+ .sb_backing_dev =
+ LIST_HEAD_INIT(security_hook_heads.sb_backing_dev),
.dentry_init_security =
LIST_HEAD_INIT(security_hook_heads.dentry_init_security),
#ifdef CONFIG_SECURITY_PATH
Stephen Smalley
2015-07-30 15:57:24 UTC
Permalink
Post by Seth Forshee
Post by Seth Forshee
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.
Not tested, but something like this should work. Note that it should
come after the call to security_fs_use() so we know whether SELinux
would even try to use xattrs supplied by the filesystem in the first place.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c..84da3a2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ struct block_device *bdev = sb->s_bdev;
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ if (bdev) {
+ struct inode_security_struct *isec =
bdev->bd_inode;
That should be bdev->bd_inode->i_security.
Sorry, this won't work. bd_inode is not the inode of the block device
file that was passed to mount, and it isn't labeled in any way. It will
just be unlabeled.
So I guess the only real option here as a fallback is
sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
only case where we currently assign task labels to files is for their
/proc/pid inodes, and no current policy will therefore allow create
permission to such files.
Darn, you're right, that isn't the inode we want. There really doesn't
seem to be any way to get back to the one we want from the LSM, short of
adding a new hook.
Maybe list_first_entry(&sb->s_bdev->bd_inodes, struct inode, i_devices)?
Feels like a layering violation though...
Yeah, and even though that probably works out to be the inode we want in
most cases I don't think we can be absolutely certain that it is. Maybe
there's some way we could walk the list and be sure we've found the
right inode, but I'm not seeing it.
I guess we could do something like this (note that most of the changes
here are just to give a version of blkdev_get_by_path which takes a
struct path * so that the filename lookup doesn't have to be done
twice). Basically add a new hook that informs the security module of the
inode for the backing device file passed to mount and call that from
mount_bdev. The security module could grab a reference to the inode and
stash it away.
Something else to note is that, as I have it here, the hook would end up
getting called for every mount of a given block device, not just the
first. So it's possible the security module could see the hook called a
second time with a different inode that has a different label. The hook
could be changed to return int if you wanted to have the opportunity to
reject such mounts.
I'm not comfortable with this approach due to the aliasing/ambiguity you
mention, as well as being unsure as to whether we truly want to label it
the same as the backing block device (we certainly do not do that for
normal mounts). Was also expecting the vfs folks to veto this patch but
haven't seen that yet.

For now, how about if we just do this to compute the mountpoint label
for SELinux:
rc = security_transition_sid(current_sid(), current_sid(),
SECCLASS_FILE, NULL, &sbsec->mntpoint_sid);
if (rc)
goto out;

This will turn the current task context into a form suitable for a file
object, while simultaneously allowing the policy writer to specify a
different label for the files through policy transition rules if desired.
Post by Seth Forshee
Seth
---
diff --git a/fs/block_dev.c b/fs/block_dev.c
index f8ce371c437c..dc2173e24e30 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1372,14 +1372,39 @@ int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder)
}
EXPORT_SYMBOL(blkdev_get);
+static struct block_device *__lookup_bdev(struct path *path);
+
+struct block_device * __blkdev_get_by_path(struct path *path, fmode_t mode,
+ void *holder)
+{
+ struct block_device *bdev;
+ int err;
+
+ bdev = __lookup_bdev(path);
+ if (IS_ERR(bdev))
+ return bdev;
+
+ err = blkdev_get(bdev, mode, holder);
+ if (err)
+ return ERR_PTR(err);
+
+ if ((mode & FMODE_WRITE) && bdev_read_only(bdev)) {
+ blkdev_put(bdev, mode);
+ return ERR_PTR(-EACCES);
+ }
+
+ return bdev;
+}
+EXPORT_SYMBOL(__blkdev_get_by_path);
+
/**
* blkdev_get_by_path - open a block device by name
*
*
* On success, the returned block_device has reference count of one.
*
@@ -1389,25 +1414,22 @@ EXPORT_SYMBOL(blkdev_get);
* Pointer to block_device on success, ERR_PTR(-errno) on failure.
*/
-struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
+struct block_device *blkdev_get_by_path(const char *pathname, fmode_t mode,
void *holder)
{
struct block_device *bdev;
- int err;
-
- bdev = lookup_bdev(path);
- if (IS_ERR(bdev))
- return bdev;
+ struct path path;
+ int error;
- err = blkdev_get(bdev, mode, holder);
- if (err)
- return ERR_PTR(err);
+ if (!pathname || !*pathname)
+ return ERR_PTR(-EINVAL);
- if ((mode & FMODE_WRITE) && bdev_read_only(bdev)) {
- blkdev_put(bdev, mode);
- return ERR_PTR(-EACCES);
- }
+ error = kern_path(pathname, LOOKUP_FOLLOW, &path);
+ if (error)
+ return ERR_PTR(error);
+ bdev = __blkdev_get_by_path(&path, mode, holder);
+ path_put(&path);
return bdev;
}
EXPORT_SYMBOL(blkdev_get_by_path);
@@ -1702,6 +1724,30 @@ int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
EXPORT_SYMBOL(ioctl_by_bdev);
+static struct block_device *__lookup_bdev(struct path *path)
+{
+ struct block_device *bdev;
+ struct inode *inode;
+ int error;
+
+ inode = d_backing_inode(path->dentry);
+ error = -ENOTBLK;
+ if (!S_ISBLK(inode->i_mode))
+ goto fail;
+ error = -EACCES;
+ if (!may_open_dev(path))
+ goto fail;
+ error = -ENOMEM;
+ bdev = bd_acquire(inode);
+ if (!bdev)
+ goto fail;
+ return bdev;
+ bdev = ERR_PTR(error);
+ goto out;
+}
+
/**
* lookup_bdev - lookup a struct block_device by name
@@ -1713,7 +1759,6 @@ EXPORT_SYMBOL(ioctl_by_bdev);
struct block_device *lookup_bdev(const char *pathname)
{
struct block_device *bdev;
- struct inode *inode;
struct path path;
int error;
@@ -1724,23 +1769,9 @@ struct block_device *lookup_bdev(const char *pathname)
if (error)
return ERR_PTR(error);
- inode = d_backing_inode(path.dentry);
- error = -ENOTBLK;
- if (!S_ISBLK(inode->i_mode))
- goto fail;
- error = -EACCES;
- if (!may_open_dev(&path))
- goto fail;
- error = -ENOMEM;
- bdev = bd_acquire(inode);
- if (!bdev)
- goto fail;
+ bdev = __lookup_bdev(&path);
path_put(&path);
return bdev;
- bdev = ERR_PTR(error);
- goto out;
}
EXPORT_SYMBOL(lookup_bdev);
diff --git a/fs/super.c b/fs/super.c
index 008f938e3ec0..558f7845a171 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,7 @@
#include <linux/fsnotify.h>
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
+#include <linux/namei.h>
#include "internal.h"
@@ -980,15 +981,26 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
{
struct block_device *bdev;
struct super_block *s;
+ struct path path;
+ struct inode *inode;
fmode_t mode = FMODE_READ | FMODE_EXCL;
int error = 0;
if (!(flags & MS_RDONLY))
mode |= FMODE_WRITE;
- bdev = blkdev_get_by_path(dev_name, mode, fs_type);
- if (IS_ERR(bdev))
- return ERR_CAST(bdev);
+ if (!dev_name || !*dev_name)
+ return ERR_PTR(-EINVAL);
+
+ error = kern_path(dev_name, LOOKUP_FOLLOW, &path);
+ if (error)
+ return ERR_PTR(error);
+
+ bdev = __blkdev_get_by_path(&path, mode, fs_type);
+ if (IS_ERR(bdev)) {
+ error = PTR_ERR(bdev);
+ goto error;
+ }
/*
* once the super is inserted into the list by sget, s_umount
@@ -1040,6 +1052,10 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
bdev->bd_super = s;
}
+ inode = d_backing_inode(path.dentry);
+ security_sb_backing_dev(s, inode);
+ path_put(&path);
+
return dget(s->s_root);
blkdev_put(bdev, mode);
+ path_put(&path);
return ERR_PTR(error);
}
EXPORT_SYMBOL(mount_bdev);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4597420ab933..3748945bf0d5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2315,6 +2315,8 @@ extern int ioctl_by_bdev(struct block_device *, unsigned, unsigned long);
extern int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
extern long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
extern int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder);
+extern struct block_device *__blkdev_get_by_path(struct path *path, fmode_t mode,
+ void *holder);
extern struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
void *holder);
extern struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode,
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 9429f054c323..52ce1a094e04 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1351,6 +1351,7 @@ union security_list_options {
int (*sb_clone_mnt_opts)(const struct super_block *oldsb,
struct super_block *newsb);
int (*sb_parse_opts_str)(char *options, struct security_mnt_opts *opts);
+ void (*sb_backing_dev)(struct super_block *sb, struct inode *inode);
int (*dentry_init_security)(struct dentry *dentry, int mode,
struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -1648,6 +1649,7 @@ struct security_hook_heads {
struct list_head sb_set_mnt_opts;
struct list_head sb_clone_mnt_opts;
struct list_head sb_parse_opts_str;
+ struct list_head sb_backing_dev;
struct list_head dentry_init_security;
#ifdef CONFIG_SECURITY_PATH
struct list_head path_unlink;
diff --git a/include/linux/security.h b/include/linux/security.h
index 79d85ddf8093..7a4d8382af20 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -231,6 +231,7 @@ int security_sb_set_mnt_opts(struct super_block *sb,
int security_sb_clone_mnt_opts(const struct super_block *oldsb,
struct super_block *newsb);
int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts);
+void security_sb_backing_dev(struct super_block *sb, struct inode *inode);
int security_dentry_init_security(struct dentry *dentry, int mode,
struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -562,6 +563,10 @@ static inline int security_sb_parse_opts_str(char *options, struct security_mnt_
return 0;
}
+static inline void security_sb_backing_dev(struct super_block *sb,
+ struct inode *inode)
+{ }
+
static inline int security_inode_alloc(struct inode *inode)
{
return 0;
diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..f6f89e0f06d8 100644
--- a/security/security.c
+++ b/security/security.c
@@ -347,6 +347,11 @@ int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts)
}
EXPORT_SYMBOL(security_sb_parse_opts_str);
+void security_sb_backing_dev(struct super_block *sb, struct inode *inode)
+{
+ call_void_hook(sb_backing_dev, sb, inode);
+}
+
int security_inode_alloc(struct inode *inode)
{
inode->i_security = NULL;
@@ -1595,6 +1600,8 @@ struct security_hook_heads security_hook_heads = {
LIST_HEAD_INIT(security_hook_heads.sb_clone_mnt_opts),
.sb_parse_opts_str =
LIST_HEAD_INIT(security_hook_heads.sb_parse_opts_str),
+ .sb_backing_dev =
+ LIST_HEAD_INIT(security_hook_heads.sb_backing_dev),
.dentry_init_security =
LIST_HEAD_INIT(security_hook_heads.dentry_init_security),
#ifdef CONFIG_SECURITY_PATH
Seth Forshee
2015-07-30 16:24:13 UTC
Permalink
Post by Stephen Smalley
Post by Seth Forshee
Post by Seth Forshee
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Post by Stephen Smalley
Post by Stephen Smalley
Post by Seth Forshee
Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.
I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.
The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.
Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.
Not tested, but something like this should work. Note that it should
come after the call to security_fs_use() so we know whether SELinux
would even try to use xattrs supplied by the filesystem in the first place.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c..84da3a2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ struct block_device *bdev = sb->s_bdev;
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ if (bdev) {
+ struct inode_security_struct *isec =
bdev->bd_inode;
That should be bdev->bd_inode->i_security.
Sorry, this won't work. bd_inode is not the inode of the block device
file that was passed to mount, and it isn't labeled in any way. It will
just be unlabeled.
So I guess the only real option here as a fallback is
sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
only case where we currently assign task labels to files is for their
/proc/pid inodes, and no current policy will therefore allow create
permission to such files.
Darn, you're right, that isn't the inode we want. There really doesn't
seem to be any way to get back to the one we want from the LSM, short of
adding a new hook.
Maybe list_first_entry(&sb->s_bdev->bd_inodes, struct inode, i_devices)?
Feels like a layering violation though...
Yeah, and even though that probably works out to be the inode we want in
most cases I don't think we can be absolutely certain that it is. Maybe
there's some way we could walk the list and be sure we've found the
right inode, but I'm not seeing it.
I guess we could do something like this (note that most of the changes
here are just to give a version of blkdev_get_by_path which takes a
struct path * so that the filename lookup doesn't have to be done
twice). Basically add a new hook that informs the security module of the
inode for the backing device file passed to mount and call that from
mount_bdev. The security module could grab a reference to the inode and
stash it away.
Something else to note is that, as I have it here, the hook would end up
getting called for every mount of a given block device, not just the
first. So it's possible the security module could see the hook called a
second time with a different inode that has a different label. The hook
could be changed to return int if you wanted to have the opportunity to
reject such mounts.
I'm not comfortable with this approach due to the aliasing/ambiguity you
mention, as well as being unsure as to whether we truly want to label it
the same as the backing block device (we certainly do not do that for
normal mounts). Was also expecting the vfs folks to veto this patch but
haven't seen that yet.
Yeah, I wasn't necessarily suggesting that this was a _good_ way to go,
only that I couldn't find a workable alternative.
Post by Stephen Smalley
For now, how about if we just do this to compute the mountpoint label
rc = security_transition_sid(current_sid(), current_sid(),
SECCLASS_FILE, NULL, &sbsec->mntpoint_sid);
if (rc)
goto out;
This will turn the current task context into a form suitable for a file
object, while simultaneously allowing the policy writer to specify a
different label for the files through policy transition rules if desired.
Great, I'll incorporate this. Thanks!

Seth
Seth Forshee
2015-07-15 19:46:08 UTC
Permalink
Avoid use of untrusted security labels when s_user_ns !=
init_user_ns:
- smk_fetch: refuse to read labels from disk
- smack_inode_init_security: return -ENOTSUPP
- smack_d_instantiate: don't use security xattrs from disk

Signed-off-by: Seth Forshee <***@canonical.com>
---
security/smack/smack_lsm.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..6a849da94f47 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,9 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;

+ if (ip->i_sb->s_user_ns != &init_user_ns)
+ return NULL;
+
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);

@@ -833,6 +836,9 @@ static int smack_inode_init_security(struct inode *inode, struct inode *dir,
struct smack_known *dsp = smk_of_inode(dir);
int may;

+ if (inode->i_sb->s_user_ns != &init_user_ns)
+ return -ENOTSUPP;
+
if (name)
*name = XATTR_SMACK_SUFFIX;

@@ -3176,11 +3182,13 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
}
/*
* No xattr support means, alas, no SMACK label.
- * Use the aforeapplied default.
+ * Use the aforeapplied default. Also don't use
+ * xattrs from userns mounts.
* It would be curious if the label of the task
* does not match that assigned.
*/
- if (inode->i_op->getxattr == NULL)
+ if (inode->i_sb->s_user_ns != &init_user_ns ||
+ inode->i_op->getxattr == NULL)
break;
/*
* Get the dentry for xattr.
--
1.9.1
Casey Schaufler
2015-07-15 20:43:41 UTC
Permalink
Post by Seth Forshee
Avoid use of untrusted security labels when s_user_ns !=
- smk_fetch: refuse to read labels from disk
- smack_inode_init_security: return -ENOTSUPP
- smack_d_instantiate: don't use security xattrs from disk
I do not like this at all at all. Pretending that Smack
doesn't exist in a user namespace can lead to all sorts
of blatant security violations, both while the filesystem
is mounted in the namespace and in the init namespace.
Post by Seth Forshee
---
security/smack/smack_lsm.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..6a849da94f47 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,9 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;
+ if (ip->i_sb->s_user_ns != &init_user_ns)
+ return NULL;
+
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);
@@ -833,6 +836,9 @@ static int smack_inode_init_security(struct inode *inode, struct inode *dir,
struct smack_known *dsp = smk_of_inode(dir);
int may;
+ if (inode->i_sb->s_user_ns != &init_user_ns)
+ return -ENOTSUPP;
+
if (name)
*name = XATTR_SMACK_SUFFIX;
@@ -3176,11 +3182,13 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
}
/*
* No xattr support means, alas, no SMACK label.
- * Use the aforeapplied default.
+ * Use the aforeapplied default. Also don't use
+ * xattrs from userns mounts.
* It would be curious if the label of the task
* does not match that assigned.
*/
- if (inode->i_op->getxattr == NULL)
+ if (inode->i_sb->s_user_ns != &init_user_ns ||
+ inode->i_op->getxattr == NULL)
break;
/*
* Get the dentry for xattr.
Amir Goldstein
2015-07-30 04:24:11 UTC
Permalink
On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
Seth,

There were 2 main concerns discussed in this thread:
1. trusting LSM labels outside the namespace
2. trusting the content of the image file/loopdev

While your approach addresses the first concern, I suspect it may be placing
an obstacle in a way for resolving the second concern.

A viable security policy to mitigate the second concern could be:
- Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
- Allow mount only of 'Loopback' images

This should allow the system as a whole to trust unprivileged mounts based on
the trust of the entities that had raw access the the fs layout.

Alas, if you choose to propagate the backing dev label to contained files,
they would all share the designated 'Loopback' label and render the policy above
useless.

Any thoughts on how to reconcile this conflict?

Amir.
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
All right, I've got a patch which I think does this, and I've managed to
do some testing to confirm that it behaves like I expect. How does this
look?
What's missing is getting the label from the block device inode; as
Stephen discovered the inode that I thought we could get the label from
turned out to be the wrong one. Afaict we would need a new hook in order
to do that, so for now I'm using the label of the proccess calling
mount.
---
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..8e631a66b03c 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
skp = smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
+ if (sb_in_userns(sb))
+ transmute = 1;
}
/*
* Initialize the root inode.
@@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
if (mask == 0)
return 0;
+ if (sb_in_userns(inode->i_sb)) {
+ struct superblock_smack *sbsp = inode->i_sb->s_security;
+ if (smk_of_inode(inode) != sbsp->smk_root)
+ return -EACCES;
+ }
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
@@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
- /*
- */
- skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
- if (IS_ERR(skp) || skp == &smack_known_star ||
- skp == &smack_known_web)
- skp = NULL;
- isp->smk_task = skp;
+ if (!sb_in_userns(inode->i_sb)) {
+ /*
+ */
+ skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+ if (IS_ERR(skp) || skp == &smack_known_star ||
+ skp == &smack_known_web)
+ skp = NULL;
+ isp->smk_task = skp;
+ }
skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
if (IS_ERR(skp) || skp == &smack_known_star ||
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Seth Forshee
2015-07-30 13:55:31 UTC
Permalink
Post by Amir Goldstein
On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
Seth,
1. trusting LSM labels outside the namespace
2. trusting the content of the image file/loopdev
While your approach addresses the first concern, I suspect it may be placing
an obstacle in a way for resolving the second concern.
- Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
- Allow mount only of 'Loopback' images
This should allow the system as a whole to trust unprivileged mounts based on
the trust of the entities that had raw access the the fs layout.
You don't really say what you mean by "trusted" programs. In a container
context I'd have to assume that you mean suid-root or similar programs
shared into the container by the host. In that case is any new kernel
functionality even required?

That also doesn't work for some of our use cases, where we'd like to be
able to do something like "mount -o loop foo.img /mnt/foo" in an
unprivileged container where foo.img is not created on the local machine
and not fully under control of the host environment.

Agreed though that the "attack from below" problem for untrusted
filesystems is still an open question. At minimum we have fuse, which
has been designed to protect against this threat. Others have mentioned
on this thread that Ted had said something at kernel summit last year
about being willing to support ext4 mounts from unprivileged user
namespaces as well. I've added Ted to the Cc in case he wants to confirm
or deny this rumor.
Post by Amir Goldstein
Alas, if you choose to propagate the backing dev label to contained files,
they would all share the designated 'Loopback' label and render the policy above
useless.
Any thoughts on how to reconcile this conflict?
I'm not seeing what the conflict is here - nothing you proposed says
anything about security labels in the filesystem, and nothing would
prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
label was desired on the backing device. Care to elaborate?

Seth
Amir Goldstein
2015-07-30 14:47:02 UTC
Permalink
On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
Post by Seth Forshee
Post by Amir Goldstein
On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
Seth,
1. trusting LSM labels outside the namespace
2. trusting the content of the image file/loopdev
While your approach addresses the first concern, I suspect it may be placing
an obstacle in a way for resolving the second concern.
- Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
- Allow mount only of 'Loopback' images
This should allow the system as a whole to trust unprivileged mounts based on
the trust of the entities that had raw access the the fs layout.
You don't really say what you mean by "trusted" programs. In a container
context I'd have to assume that you mean suid-root or similar programs
shared into the container by the host. In that case is any new kernel
functionality even required?
Sorry I was not clear. I will try to explain better.
I meant that the programs are "trusted" by the LSM security policy.
I envisioned a system where unprivileged user is allowed to spawn
a container which contains "trusted" programs (e.g. mkfs) that are labeled
as 'FileSystemTools' by the admin of the host.
FileSystemTools are allowed to write into Loopback labeled files.
Post by Seth Forshee
That also doesn't work for some of our use cases, where we'd like to be
able to do something like "mount -o loop foo.img /mnt/foo" in an
unprivileged container where foo.img is not created on the local machine
and not fully under control of the host environment.
That use case will not be addressed by the policy I suggested,
but the more common case of:
- create a loopback file
- mkfs
- mount
will be addressed.

So if the (host) admin of the system trusts that unprivileged user cannot create
a malicious fs layout using mkfs and fsck alone, then the system is
relatively safe
mounting (non fuse) file systems from loopback files.
IMHO, this statement is going to be easier for Ted to sign.
Post by Seth Forshee
Agreed though that the "attack from below" problem for untrusted
filesystems is still an open question. At minimum we have fuse, which
has been designed to protect against this threat. Others have mentioned
on this thread that Ted had said something at kernel summit last year
about being willing to support ext4 mounts from unprivileged user
namespaces as well. I've added Ted to the Cc in case he wants to confirm
or deny this rumor.
Post by Amir Goldstein
Alas, if you choose to propagate the backing dev label to contained files,
they would all share the designated 'Loopback' label and render the policy above
useless.
Any thoughts on how to reconcile this conflict?
I'm not seeing what the conflict is here - nothing you proposed says
anything about security labels in the filesystem, and nothing would
prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
label was desired on the backing device. Care to elaborate?
Seth
Casey Schaufler
2015-07-30 15:33:27 UTC
Permalink
Post by Amir Goldstein
On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
Post by Seth Forshee
Post by Amir Goldstein
On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
Seth,
1. trusting LSM labels outside the namespace
2. trusting the content of the image file/loopdev
While your approach addresses the first concern, I suspect it may be placing
an obstacle in a way for resolving the second concern.
- Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
- Allow mount only of 'Loopback' images
This should allow the system as a whole to trust unprivileged mounts based on
the trust of the entities that had raw access the the fs layout.
You don't really say what you mean by "trusted" programs. In a container
context I'd have to assume that you mean suid-root or similar programs
shared into the container by the host. In that case is any new kernel
functionality even required?
Sorry I was not clear. I will try to explain better.
I meant that the programs are "trusted" by the LSM security policy.
I envisioned a system where unprivileged user is allowed to spawn
a container which contains "trusted" programs (e.g. mkfs) that are labeled
as 'FileSystemTools' by the admin of the host.
FileSystemTools are allowed to write into Loopback labeled files.
You could do this on a Smack based system. It would require
CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
to set some SMACK64EXEC labels on your FileSystemTools, and
they would have to be written as carefully as the would if they
had "more" privilege. You'd need to designate a repository for
your loopback files. On the whole, it would be unattractive.
I will pass on providing the details for fear someone will like
it well enough to implement.
Post by Amir Goldstein
Post by Seth Forshee
That also doesn't work for some of our use cases, where we'd like to be
able to do something like "mount -o loop foo.img /mnt/foo" in an
unprivileged container where foo.img is not created on the local machine
and not fully under control of the host environment.
That use case will not be addressed by the policy I suggested,
- create a loopback file
- mkfs
- mount
will be addressed.
So if the (host) admin of the system trusts that unprivileged user cannot create
a malicious fs layout using mkfs and fsck alone, then the system is
relatively safe
mounting (non fuse) file systems from loopback files.
IMHO, this statement is going to be easier for Ted to sign.
But that sort of defeats the purpose of unprivileged mounts.
Or rather, you're trying to place restrictions on what an
unprivileged user can do without calling the ability to
violate those restrictions "privilege".
Post by Amir Goldstein
Post by Seth Forshee
Agreed though that the "attack from below" problem for untrusted
filesystems is still an open question. At minimum we have fuse, which
has been designed to protect against this threat. Others have mentioned
on this thread that Ted had said something at kernel summit last year
about being willing to support ext4 mounts from unprivileged user
namespaces as well. I've added Ted to the Cc in case he wants to confirm
or deny this rumor.
Post by Amir Goldstein
Alas, if you choose to propagate the backing dev label to contained files,
they would all share the designated 'Loopback' label and render the policy above
useless.
Any thoughts on how to reconcile this conflict?
I'm not seeing what the conflict is here - nothing you proposed says
anything about security labels in the filesystem, and nothing would
prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
label was desired on the backing device. Care to elaborate?
Seth
Colin Walters
2015-07-30 15:52:07 UTC
Permalink
It's worth noting here that I think a lot of the use cases
for unprivileged mounts are testing/development type things,
and these are pretty well covered by:

http://libguestfs.org/

Basically it just runs the host kernel in a VM, and the userspace
is a minimal agent that you can talk to over virtio. You can use
the API, or `guestmount` exposes it via FUSE.

It doesn't magically make the kernel filesystems robust against
untrusted input, but in the case of compromise, it's an
"unprivileged" VM. I've used it for several projects and been
quite happy.
Eric W. Biederman
2015-07-30 16:15:48 UTC
Permalink
Post by Colin Walters
It's worth noting here that I think a lot of the use cases
for unprivileged mounts are testing/development type things,
http://libguestfs.org/
Basically it just runs the host kernel in a VM, and the userspace
is a minimal agent that you can talk to over virtio. You can use
the API, or `guestmount` exposes it via FUSE.
It doesn't magically make the kernel filesystems robust against
untrusted input, but in the case of compromise, it's an
"unprivileged" VM. I've used it for several projects and been
quite happy.
Thanks for pointing this out. That makes it clear we only have to get
as far as making fuse work for this work to be useful in practice.

Eric
Serge Hallyn
2015-07-30 13:57:10 UTC
Permalink
Post by Amir Goldstein
On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
Seth,
1. trusting LSM labels outside the namespace
2. trusting the content of the image file/loopdev
While your approach addresses the first concern, I suspect it may be placing
an obstacle in a way for resolving the second concern.
- Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
- Allow mount only of 'Loopback' images
This should allow the system as a whole to trust unprivileged mounts based on
the trust of the entities that had raw access the the fs layout.
Just to be sure I understand right, you're looking for a way to let
the host admin trust that the kernel's superblock parsers aren't being
fed trash or an exploit?
Post by Amir Goldstein
Alas, if you choose to propagate the backing dev label to contained files,
they would all share the designated 'Loopback' label and render the policy above
useless.
Any thoughts on how to reconcile this conflict?
Amir.
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
All right, I've got a patch which I think does this, and I've managed to
do some testing to confirm that it behaves like I expect. How does this
look?
What's missing is getting the label from the block device inode; as
Stephen discovered the inode that I thought we could get the label from
turned out to be the wrong one. Afaict we would need a new hook in order
to do that, so for now I'm using the label of the proccess calling
mount.
---
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..8e631a66b03c 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
skp = smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
+ if (sb_in_userns(sb))
+ transmute = 1;
}
/*
* Initialize the root inode.
@@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
if (mask == 0)
return 0;
+ if (sb_in_userns(inode->i_sb)) {
+ struct superblock_smack *sbsp = inode->i_sb->s_security;
+ if (smk_of_inode(inode) != sbsp->smk_root)
+ return -EACCES;
+ }
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
@@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
- /*
- */
- skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
- if (IS_ERR(skp) || skp == &smack_known_star ||
- skp == &smack_known_web)
- skp = NULL;
- isp->smk_task = skp;
+ if (!sb_in_userns(inode->i_sb)) {
+ /*
+ */
+ skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+ if (IS_ERR(skp) || skp == &smack_known_star ||
+ skp == &smack_known_web)
+ skp = NULL;
+ isp->smk_task = skp;
+ }
skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
if (IS_ERR(skp) || skp == &smack_known_star ||
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Amir Goldstein
2015-07-30 15:09:10 UTC
Permalink
Post by Serge Hallyn
Post by Amir Goldstein
On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
Seth,
1. trusting LSM labels outside the namespace
2. trusting the content of the image file/loopdev
While your approach addresses the first concern, I suspect it may be placing
an obstacle in a way for resolving the second concern.
- Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
- Allow mount only of 'Loopback' images
This should allow the system as a whole to trust unprivileged mounts based on
the trust of the entities that had raw access the the fs layout.
Just to be sure I understand right, you're looking for a way to let
the host admin trust that the kernel's superblock parsers aren't being
fed trash or an exploit?
Correct.
I do not believe in the direction of auditing file system code to
vulnerability free level
nor do I think that cryptographically signed file system metadata is
the only way
to ensure an exploit free unprivileged mount.
Post by Serge Hallyn
Post by Amir Goldstein
Alas, if you choose to propagate the backing dev label to contained files,
they would all share the designated 'Loopback' label and render the policy above
useless.
Any thoughts on how to reconcile this conflict?
Amir.
Post by Seth Forshee
Post by Casey Schaufler
Post by Seth Forshee
2. s_root is assigned the transmute property.
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.
That's right. Accept correct data, reject anything that's not right.
Post by Seth Forshee
If this is right, there are a couple lingering questions in my mind.
First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.
Yes.
Post by Seth Forshee
The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.
Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.
Post by Seth Forshee
So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.
In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".
Post by Seth Forshee
Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.
My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.
All right, I've got a patch which I think does this, and I've managed to
do some testing to confirm that it behaves like I expect. How does this
look?
What's missing is getting the label from the block device inode; as
Stephen discovered the inode that I thought we could get the label from
turned out to be the wrong one. Afaict we would need a new hook in order
to do that, so for now I'm using the label of the proccess calling
mount.
---
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..8e631a66b03c 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
skp = smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
+ if (sb_in_userns(sb))
+ transmute = 1;
}
/*
* Initialize the root inode.
@@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
if (mask == 0)
return 0;
+ if (sb_in_userns(inode->i_sb)) {
+ struct superblock_smack *sbsp = inode->i_sb->s_security;
+ if (smk_of_inode(inode) != sbsp->smk_root)
+ return -EACCES;
+ }
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
@@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
- /*
- */
- skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
- if (IS_ERR(skp) || skp == &smack_known_star ||
- skp == &smack_known_web)
- skp = NULL;
- isp->smk_task = skp;
+ if (!sb_in_userns(inode->i_sb)) {
+ /*
+ */
+ skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+ if (IS_ERR(skp) || skp == &smack_known_star ||
+ skp == &smack_known_web)
+ skp = NULL;
+ isp->smk_task = skp;
+ }
skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
if (IS_ERR(skp) || skp == &smack_known_star ||
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Amir Goldstein
2015-07-31 08:11:13 UTC
Permalink
Post by Casey Schaufler
Post by Amir Goldstein
On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
Post by Seth Forshee
Post by Amir Goldstein
On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
Seth,
1. trusting LSM labels outside the namespace
2. trusting the content of the image file/loopdev
While your approach addresses the first concern, I suspect it may be placing
an obstacle in a way for resolving the second concern.
- Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
- Allow mount only of 'Loopback' images
This should allow the system as a whole to trust unprivileged mounts based on
the trust of the entities that had raw access the the fs layout.
You don't really say what you mean by "trusted" programs. In a container
context I'd have to assume that you mean suid-root or similar programs
shared into the container by the host. In that case is any new kernel
functionality even required?
Sorry I was not clear. I will try to explain better.
I meant that the programs are "trusted" by the LSM security policy.
I envisioned a system where unprivileged user is allowed to spawn
a container which contains "trusted" programs (e.g. mkfs) that are labeled
as 'FileSystemTools' by the admin of the host.
FileSystemTools are allowed to write into Loopback labeled files.
You could do this on a Smack based system. It would require
CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
to set some SMACK64EXEC labels on your FileSystemTools, and
they would have to be written as carefully as the would if they
had "more" privilege. You'd need to designate a repository for
your loopback files. On the whole, it would be unattractive.
I will pass on providing the details for fear someone will like
it well enough to implement.
Post by Amir Goldstein
Post by Seth Forshee
That also doesn't work for some of our use cases, where we'd like to be
able to do something like "mount -o loop foo.img /mnt/foo" in an
unprivileged container where foo.img is not created on the local machine
and not fully under control of the host environment.
That use case will not be addressed by the policy I suggested,
- create a loopback file
- mkfs
- mount
will be addressed.
So if the (host) admin of the system trusts that unprivileged user cannot create
a malicious fs layout using mkfs and fsck alone, then the system is
relatively safe
mounting (non fuse) file systems from loopback files.
IMHO, this statement is going to be easier for Ted to sign.
But that sort of defeats the purpose of unprivileged mounts.
Or rather, you're trying to place restrictions on what an
unprivileged user can do without calling the ability to
violate those restrictions "privilege".
I don't understand your concern.
I am saying that LSM can come to the rescue, in a use case that
many have been considering as unsolvable (i.e. the loopback tampering).

Yes, I am trying to place restrictions on what an unprivileged user can do.
As it stands right now, user is about to gain the ability to mount FUSE.
With some extra care on crafting the policy and without any extra code,
user can gain the ability to mount "trusted loopback files".
It does not solve all use cases, but it does solve a handful.

Anyway, the concern I was raising was about the fact that if files inside
the loopback mount inherit the label of the loopback file, this policy is
going to be impossible to write.
But Stephan has already proposed an alternative to this implicit inherit rule
on [PATCH 6/7] thread, so I withdraw my concern.
Post by Casey Schaufler
Post by Amir Goldstein
Post by Seth Forshee
Agreed though that the "attack from below" problem for untrusted
filesystems is still an open question. At minimum we have fuse, which
has been designed to protect against this threat. Others have mentioned
on this thread that Ted had said something at kernel summit last year
about being willing to support ext4 mounts from unprivileged user
namespaces as well. I've added Ted to the Cc in case he wants to confirm
or deny this rumor.
Post by Amir Goldstein
Alas, if you choose to propagate the backing dev label to contained files,
they would all share the designated 'Loopback' label and render the policy above
useless.
Any thoughts on how to reconcile this conflict?
I'm not seeing what the conflict is here - nothing you proposed says
anything about security labels in the filesystem, and nothing would
prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
label was desired on the backing device. Care to elaborate?
Seth
Casey Schaufler
2015-07-31 19:56:59 UTC
Permalink
Post by Amir Goldstein
Post by Casey Schaufler
Post by Amir Goldstein
On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
Post by Seth Forshee
Post by Amir Goldstein
On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
Post by Seth Forshee
1. smk_root and smk_default are assigned the label of the backing
device.
Seth,
1. trusting LSM labels outside the namespace
2. trusting the content of the image file/loopdev
While your approach addresses the first concern, I suspect it may be placing
an obstacle in a way for resolving the second concern.
- Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
- Allow mount only of 'Loopback' images
This should allow the system as a whole to trust unprivileged mounts based on
the trust of the entities that had raw access the the fs layout.
You don't really say what you mean by "trusted" programs. In a container
context I'd have to assume that you mean suid-root or similar programs
shared into the container by the host. In that case is any new kernel
functionality even required?
Sorry I was not clear. I will try to explain better.
I meant that the programs are "trusted" by the LSM security policy.
I envisioned a system where unprivileged user is allowed to spawn
a container which contains "trusted" programs (e.g. mkfs) that are labeled
as 'FileSystemTools' by the admin of the host.
FileSystemTools are allowed to write into Loopback labeled files.
You could do this on a Smack based system. It would require
CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
to set some SMACK64EXEC labels on your FileSystemTools, and
they would have to be written as carefully as the would if they
had "more" privilege. You'd need to designate a repository for
your loopback files. On the whole, it would be unattractive.
I will pass on providing the details for fear someone will like
it well enough to implement.
Post by Amir Goldstein
Post by Seth Forshee
That also doesn't work for some of our use cases, where we'd like to be
able to do something like "mount -o loop foo.img /mnt/foo" in an
unprivileged container where foo.img is not created on the local machine
and not fully under control of the host environment.
That use case will not be addressed by the policy I suggested,
- create a loopback file
- mkfs
- mount
will be addressed.
So if the (host) admin of the system trusts that unprivileged user cannot create
a malicious fs layout using mkfs and fsck alone, then the system is
relatively safe
mounting (non fuse) file systems from loopback files.
IMHO, this statement is going to be easier for Ted to sign.
But that sort of defeats the purpose of unprivileged mounts.
Or rather, you're trying to place restrictions on what an
unprivileged user can do without calling the ability to
violate those restrictions "privilege".
I don't understand your concern.
My concern is that you're playing a shell game. Allow unprivileged
mounts, but only of things that where created using privilege. How
is that better than requiring privilege to do the mount?
Post by Amir Goldstein
I am saying that LSM can come to the rescue, in a use case that
many have been considering as unsolvable (i.e. the loopback tampering).
Yes, I am trying to place restrictions on what an unprivileged user can do.
As it stands right now, user is about to gain the ability to mount FUSE.
With some extra care on crafting the policy and without any extra code,
user can gain the ability to mount "trusted loopback files".
It does not solve all use cases, but it does solve a handful.
As I said, you can do this, but it will be ugly, and people won't
understand how to use it correctly. The distance between the "trusted"
creation of the filesystem and the "untrusted" mount is too great.
Plus, there are too many ways to circumvent the integrity of your
"trusted" filesystem.
Post by Amir Goldstein
Anyway, the concern I was raising was about the fact that if files inside
the loopback mount inherit the label of the loopback file, this policy is
going to be impossible to write.
But Stephan has already proposed an alternative to this implicit inherit rule
on [PATCH 6/7] thread, so I withdraw my concern.
What Stephan has proposed is dandy for SELinux.
Post by Amir Goldstein
Post by Casey Schaufler
Post by Amir Goldstein
Post by Seth Forshee
Agreed though that the "attack from below" problem for untrusted
filesystems is still an open question. At minimum we have fuse, which
has been designed to protect against this threat. Others have mentioned
on this thread that Ted had said something at kernel summit last year
about being willing to support ext4 mounts from unprivileged user
namespaces as well. I've added Ted to the Cc in case he wants to confirm
or deny this rumor.
Post by Amir Goldstein
Alas, if you choose to propagate the backing dev label to contained files,
they would all share the designated 'Loopback' label and render the policy above
useless.
Any thoughts on how to reconcile this conflict?
I'm not seeing what the conflict is here - nothing you proposed says
anything about security labels in the filesystem, and nothing would
prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
label was desired on the backing device. Care to elaborate?
Seth
Continue reading on narkive:
Loading...