WEBVTT

00:00.000 --> 00:11.880
Right, okay, thank you, everyone, for staying till the end, as I said before, I know it's

00:11.880 --> 00:17.520
tempting to go get some beers and Sunday afternoon, but you stuck out, stuck with us.

00:17.520 --> 00:24.440
All these are going to talk about writing an application kernel in Rust, and let's give

00:25.360 --> 00:28.560
the last of your Sunday energy to welcome him.

00:37.200 --> 00:39.600
Okay, this is work, yeah?

00:39.600 --> 00:40.880
All right, hello, everyone.

00:40.880 --> 00:44.040
This is Sid Reipen and application kernel in Rust.

00:44.040 --> 00:47.080
Again, thank you for staying until this late hour.

00:47.080 --> 00:51.080
I hope I'll make it as entertaining as for you as possible, but it's a bit of a

00:51.080 --> 00:53.480
difficult subject, so bear with me.

00:53.640 --> 00:58.480
So I'm a little bit of an introduction about me, so maybe you trust me.

00:58.480 --> 01:04.200
I'm an external developer for the past 15 years, I used to be an external developer, and

01:04.200 --> 01:07.240
the main author of Sid Box, our topic today.

01:07.240 --> 01:12.200
I'm also an international test trainer, and the co-founder of just without boundaries,

01:12.200 --> 01:16.520
where we try to provide accessible materials to chess players with disabilities.

01:16.520 --> 01:21.240
And you can see my interest in email here if you want to contact me, feel free to.

01:22.200 --> 01:26.200
Here is the basic outline of what we are going to do today.

01:26.200 --> 01:28.840
First, we'll define what's an application kernel.

01:28.840 --> 01:33.560
It's a bit of a vague term, so making a definition versus logical.

01:33.560 --> 01:40.200
Then I'm going to explain you how physical interception actually works, so you'll get a more

01:40.200 --> 01:45.080
wide understanding of how application canals work, and then it's the main meat of the

01:45.080 --> 01:50.520
matter is of course rust, why did I pick rust, what are our memory safe to patterns,

01:50.520 --> 01:55.720
and the trade-offs between safety and performance, and then I'll finish talking a bit about

01:55.720 --> 01:57.880
our testing infrastructure and Q&A.

01:59.240 --> 02:01.720
So what's an application kernel?

02:01.720 --> 02:05.960
Application canals are really vague term, it's both an application and a kernel,

02:05.960 --> 02:09.560
but neither an application nor a kernel, so it's somewhere in between.

02:09.560 --> 02:12.600
And here is a nice description I found from an article.

02:12.600 --> 02:17.560
It's a library operating system variant that interseps emulates and transforms

02:17.560 --> 02:20.760
physical in user space for sandbox processes.

02:20.760 --> 02:23.800
These three are important, I'll explain a bit more later,

02:23.800 --> 02:26.920
interception emulation and transformation, right?

02:26.920 --> 02:33.640
And the way it does it is interseps system calls via set comp unotify, unotify stands for

02:33.640 --> 02:40.520
user notification, it's a new API added to set comp in recent Linux's, and it's main

02:40.520 --> 02:45.320
difference from petrase is you can handle system calls simultaneously from different

02:45.320 --> 02:51.000
threats, unlike petrase, where you have to serialize, and we also use petrase and

02:51.000 --> 02:56.440
unlock optionally, and what it does is it emulates file system network and process

02:56.440 --> 03:03.080
operations, it transforms paths, flags and credentials at runtime, and you can configure it

03:03.080 --> 03:10.440
dynamically via a website virtual path, I'll delve into a bit deeper later, and similar project

03:10.440 --> 03:17.640
are Google's device, which is written and go, very similar device reuse is the second trap API,

03:17.640 --> 03:24.440
and we use the comp unotify, so the idea is in the system calls are handled in different

03:24.440 --> 03:29.240
processes, but in device, it's all in the single process, so depending on your use case,

03:29.240 --> 03:34.920
either one may be more secure, and ramp canals of net BST fame, where you can develop your

03:34.920 --> 03:40.680
canals and applications, so you don't, you're not scared if a crash can crash the whole system,

03:40.680 --> 03:46.200
and other examples are not the containers or spheres, and there are many other examples

03:46.200 --> 03:53.960
what these are the standing zones, and so what's it does actually here, I try to make it as

03:53.960 --> 04:00.360
simple as possible, but not simpler, so on the left you see the path that an open call

04:00.360 --> 04:08.360
takes until it's implemented in sit, and open if you don't know is a very basic unique system call,

04:08.360 --> 04:15.560
which you use to open a file, and you get a file descriptor object from the canal with which

04:15.560 --> 04:21.960
you can use read, write, and so on, and when the sandbox process opens a file, the first thing

04:21.960 --> 04:28.600
that the Linux canal does is send this second-notification to the sit emulator threat that's

04:28.600 --> 04:33.480
handling it, there are many threats, I will tell you see one of them will pick it, and this

04:33.480 --> 04:40.520
notification has the system call number and the arguments, but the arguments are not, we have to first

04:40.520 --> 04:46.360
process them to make them useful, right? We get a pointer, not a string, right? The pointer

04:46.360 --> 04:54.360
tells us where the string is located in sandbox process memory, so in either another step with process

04:54.360 --> 05:00.840
VM read, we actually read this string into our process space, and process VM read, we is a bit more

05:00.840 --> 05:08.280
secure than the old way of reading the prokbitmem, because it respects the other space permissions

05:08.280 --> 05:14.360
of the sandbox process, and now we have a string, this thing is a path name from the perspective

05:14.360 --> 05:20.520
of the sandbox process, it can be a relative path, it can be an absolute path, so we need

05:20.520 --> 05:27.160
another step to actually turn it into real path that we can make a sandbox check on it, right?

05:27.160 --> 05:35.320
And this step is the canonicalization, and canonicalization gives you as a return value to values back,

05:35.320 --> 05:40.920
one is an old path file descriptor, the other one is a canonical path, and the idea is both of

05:40.920 --> 05:47.000
them point to the same thing, old path if you don't know is a type of file descriptor that you can

05:47.000 --> 05:53.160
only use full pet operations, not for reading or writing, so the actual open hasn't happened yet,

05:53.160 --> 05:59.480
right? And then the sandbox check happens, sandbox check can have three outcomes, if you can't say

05:59.480 --> 06:05.320
the pet must be hidden, in which case the sandbox process will get a no such file or directory,

06:05.320 --> 06:10.520
it may be denied, in which case you'll get an operation that permitted, or it may be allowed, right?

06:10.520 --> 06:16.120
When it is allowed, now the final state happens where they actually do the open system call via

06:16.120 --> 06:21.320
this eight item you see, we do a broken the interaction to prevent time of check to time of use

06:21.320 --> 06:29.480
vectors here, and finally we have a file descriptor, and we use second at FTI Octial to add this file

06:29.480 --> 06:35.240
descriptor to the to the process space, and this is the, this is the big picture, and on the right,

06:35.240 --> 06:41.000
you can see many transformations can happen, over this path, you can mask a path to change the path,

06:41.000 --> 06:46.920
it may be encrypted, in which case the encrypted threat will take over, it may be a pandon,

06:46.920 --> 06:51.640
in which case we will force the appant flag, as you can see the transformations

06:51.640 --> 06:57.560
can happen safely, because it all happens in sits process and it's safe, and finally we can

06:57.560 --> 07:03.480
randomize the file descriptor to prevent file descriptor use attacks, so this is the basic idea,

07:03.480 --> 07:10.280
don't worry if you don't get it, but we will get to it, so why does this is the main

07:10.280 --> 07:15.960
meat of the matter and why I'll be at all here, right? I started writing sitbox three,

07:15.960 --> 07:22.840
which became the restoration around three years ago, and my idea was to redesign it from scratch

07:22.840 --> 07:28.280
to make it a security boundary, sitbox one was written and see, and it wasn't meant to be a

07:28.280 --> 07:35.480
security boundary, so instead of doing the conventional rear IP operating in Rust, I actually

07:35.480 --> 07:42.440
took the time to redesign it from scratch in Rust, and this way we could take advantage of many

07:42.440 --> 07:47.960
Rust goodies, right? And this is what I can recommend, you don't just blindly rewrite things

07:47.960 --> 07:54.040
in Rust, redesign it from scratch with the powers of Rust that makes it much better, and here are

07:54.040 --> 08:00.200
a bit a few examples which I will live deeper in a bit, of course memory safety is one of the

08:00.200 --> 08:05.480
prime features of Rust, right? And the many modules, we have this forbidden safe code close,

08:05.480 --> 08:11.640
so unsafe code is outright forbidden, and there are other goodies, when you are working with

08:11.640 --> 08:17.320
untrusted data like the alpha parts and the globe measure, I'm going to delve into it in a bit,

08:17.320 --> 08:25.960
alpha is the executable file format of unixes, and sit parts is out to do some restrictions,

08:25.960 --> 08:31.240
and it's completely untrusted data, right? So things like forbidding arithmetic side effects,

08:31.240 --> 08:39.400
fluctuations or wrapping helps that a malicious alpha can occur as the sandbox, right? And more

08:39.400 --> 08:47.480
over the views, the type system for Rust in our advantage, and the main use cases, we have a safe

08:47.480 --> 08:54.360
interface for Linux's MCL system called to seal a memory region so it's immutable, and I'll

08:54.440 --> 08:59.640
delve into it in a bit later, and we have this generic seal box type that turns into seal,

08:59.640 --> 09:05.480
then it's sealed, and ownership is, of course, another prime feature of Rust, and in sit box,

09:06.280 --> 09:12.760
we use this mostly for file descriptors, and this is fantastic, because file descriptor leaks is a

09:12.760 --> 09:18.680
huge problem in container security, and in Rust you can actually make your compiler work for

09:18.680 --> 09:27.160
you, and prevent this file descriptor leaks, right? And of course, the other two are things we all know,

09:27.160 --> 09:33.240
zero-cost abstractions and fearless concurrency. As I said, sit, sit can be used simultaneously,

09:33.240 --> 09:38.680
so sit is a multi-traded process that can be many emulator processes to handle system calls.

09:39.560 --> 09:45.640
So, let's dive a bit deeper into memory safety patterns. To type state patterns,

09:45.640 --> 09:53.960
this seal box, as I said, is a interface to Linux's MCL system call, and what we do is

09:53.960 --> 10:00.920
we mark the sandbox policy as immutable, when it's locked, such that a compromised sit cannot

10:00.920 --> 10:07.000
edit the sandbox anymore, and as you can see, this enam on the online, there is an enam sealable,

10:07.000 --> 10:14.120
which is generic over the type T, and it has two variants, I'm sealed and sealed, in the default

10:14.120 --> 10:20.040
unsealed state, you can edit it as you wish, and then the one way I don't pull can seal function

10:20.040 --> 10:26.040
is called, and then it turns into sealed, after that you can only read it, you can no longer edit it anymore.

10:26.040 --> 10:32.040
So, as you can see, over the seal function, and this is how we do it safely in Rust.

10:32.920 --> 10:40.200
As a parser is another example, as I said, it works on completely untrusted data,

10:40.200 --> 10:48.200
so we have a handful of forbidden links to forbid, forbidden, that can be a dosa attack,

10:49.240 --> 10:59.480
and I actually run the alpharser over a set of 68,000 malwares from virus share, and it didn't crash.

10:59.480 --> 11:05.880
So, I'm fairly confident it does the right thing. And let's talk a bit about safety and performance

11:06.040 --> 11:13.240
and what are the trade-offs. In my experience, performance is no excuse to use unsafe code,

11:13.240 --> 11:21.160
which is not really common among Rust people, as far as I can say. And here is a very good example,

11:21.160 --> 11:27.560
glopy, if you don't know, means a file name matching, if you ever written a shell and used the

11:27.560 --> 11:32.840
characters star or question mark, this is what you're using, it's a bit similar to regular

11:32.840 --> 11:39.480
expressions, but not quite. And the original glopy method of seed was inherited from arcing,

11:39.480 --> 11:46.200
and it was written exactly 40 years ago in 1986. And as a 40, 30 present to arcing,

11:46.200 --> 11:51.720
I dirote this algorithm in Kirk Cross's festival, compared to algorithm, which is known to be

11:51.720 --> 11:58.600
the fastest out there. And here is a nice example of the benchmarks of two million test cases I generated.

11:58.600 --> 12:05.240
And the wild match code has no unsafe at all. And it performs almost two and a half times faster

12:05.240 --> 12:11.640
than lips is a fan match, which is C, and has to be fast, right? But this is not the case.

12:11.640 --> 12:20.520
And another thing we use to reduce small allocations is custom pad types. Sitting mostly works

12:20.520 --> 12:29.400
on small strings, right? And this tiny vector module allows us to store the small strings in stack.

12:29.400 --> 12:36.040
And only then it overflows, it will be allocated on hips. So this cuts a lot of small allocations

12:36.040 --> 12:42.600
that sit does. And this part is a corresponding dynamically size type, which is pretty much similar

12:42.600 --> 12:49.240
to standard lips pad, but it has comparisons with sims, so on resumcipy uses much faster.

12:49.240 --> 12:58.840
So our testing infrastructure, sit box is a portable sandbox. It only runs on Linux, but it runs on

12:58.840 --> 13:04.920
most architectures that lip-saccomps supports. Some of them are here. And we have a multi-architecture

13:04.920 --> 13:11.640
pipeline that tries to test all of them. And we feed is also a multi-personality sandbox,

13:11.640 --> 13:17.560
which means you can trace a 32-bit process from a 64-bit set just fine. And again, we have

13:17.720 --> 13:23.160
cross-compile tests for that. And another nice benefit is when you're writing a kernel, everyone's

13:23.160 --> 13:29.400
test is your test, right? So next, our Linux, we have package testing on my default, and everything runs

13:29.400 --> 13:36.360
under the sandbox. And if a test fails under the sandbox, but passes without, it's a sandbox bug.

13:36.360 --> 13:44.440
And we also run the Linux testing projects, Cisco test suite, which has over 4,000 tests

13:44.440 --> 13:50.920
and Gnullips, Potsics, Compatibility Test. So we are fairly certain, sit us what Linux would do.

13:50.920 --> 13:57.880
Right? And the idea is to just be a thin layer. And another thing is, of course, security,

13:57.880 --> 14:05.080
this is a security boundary. And for every sandbox escape you found in the past, we have an

14:05.080 --> 14:14.360
integration test that makes sure it doesn't reappear again. Yeah. So this is pretty much all

14:14.360 --> 14:21.320
I have. Here is our GitLab. The code is GPL 3 and forever free. So feel free to do whatever you

14:21.320 --> 14:27.160
want with it. We have extensive documentation in the form of manual pages. And if you have any questions,

14:27.160 --> 14:31.880
you could not ask here, come over, I ask your matrix and ask. And finally, thanks to Fender,

14:31.880 --> 14:36.920
one more data for sponsoring my attendance. That's all I have. I can take questions now.

14:44.360 --> 15:04.680
Thank you very much. Sorry if I missed the point, but I wanted to ask, is there already some

15:04.680 --> 15:12.280
kind of tooling to define rules, to run application under these things like what application

15:12.280 --> 15:17.960
can access, what should be forbidden or hidden from it? Yes, yes, yes, exactly.

15:17.960 --> 15:24.040
Sit box works with text-based policies. And there are over 30 categories of access, right?

15:24.040 --> 15:29.560
So you can say allow read this path or allow right read, write, exact and all that are

15:29.560 --> 15:34.520
are all categories. And you can configure them in a text-based policy. And you can load this

15:34.520 --> 15:40.680
into the runtime. You can also configure it dynamically on the underrun, like both of them are possible.

15:42.280 --> 15:50.200
Thank you for the question. I have a small one. Okay. You are showing two clip-y lines.

15:50.200 --> 15:55.880
I'm sorry. You are showing two clip-y lines that I'm not very familiar with. What are those?

15:55.880 --> 16:01.640
And what do they do prevent? This one, yeah? This one. Yes, that one. Yes, that one. Yes, that one.

16:02.680 --> 16:06.760
Unsafe forbidden, safe code, forbidden, safe code. That's all we know, right?

16:06.840 --> 16:13.960
Arithmetic side effects is for, then you multiply two numbers. For example, and the type of

16:13.960 --> 16:21.080
workflows, what's going to happen, right? Or you are trying to multiply two numbers that won't fit

16:21.080 --> 16:27.480
into a type. All of these arithmetic that can have side effects this way, right? Rapping or

16:27.480 --> 16:34.200
overflowing, you know, all that. And in the Elf parser, like imagine the Elf is completely

16:34.200 --> 16:39.640
untrusted. The size can be wrong. Everything can be wrong. You have to work with this untrusted

16:39.640 --> 16:46.280
data and this helps with it all that, right? To not overcome the boundaries that I can explain

16:46.280 --> 16:52.760
like that. And there are many more forbidden clauses. You can feel free to take a look at the code.

16:52.760 --> 16:59.560
I have comments there. Thank you for the question. Have one there?

17:00.440 --> 17:07.880
Just one small one. Why you had those protection for arithmetical operation?

17:07.880 --> 17:12.440
I'm sorry. Why you had those protection for arithmetical operation? The library did

17:12.440 --> 17:19.320
improve by out of the box or? I don't understand the question. Yes, in a slide before.

17:19.320 --> 17:23.560
The boss protects in your mind, right? Yeah. Yeah, that's fine.

17:23.960 --> 17:29.880
Brad, I'm sorry. Yeah, we need to specify the. This basically means

17:29.880 --> 17:36.040
during parsing the Elf is untrusted, right? And anytime and overflow happens, it will return

17:36.040 --> 17:40.600
an error. It will not overflow or do something undefined with it or things like this.

17:40.600 --> 17:45.000
So you are preventing all the undefined behavior. And instead, you are returning an error.

17:45.000 --> 17:50.280
It's simple as the case. Yeah. And why you should be employed? I'm sorry.

17:50.280 --> 17:56.120
And why you should be employed? Pink Floyd, yeah? I mean, it's said box. It's a better

17:56.120 --> 18:01.960
trite, so pink Floyd. Pink Floyd is a master. Thank you. Thank you for the question.

18:06.440 --> 18:09.800
Any other questions? You're good, yeah.

18:13.320 --> 18:15.800
All right, so thank you very much. Thank you, everyone.

