WEBVTT

00:00.000 --> 00:17.200
I folks, some confusion that may have been zero overhead is better than zero copy, so it's

00:17.200 --> 00:19.720
zero copy plus plus.

00:19.720 --> 00:29.880
It's, I'm giving the stock under the SDS, not because the first use cases I have

00:29.880 --> 00:38.560
for it is for the, for the self file system and for any software-defined file system as well,

00:38.560 --> 00:39.560
right?

00:39.560 --> 00:44.920
But it's a generic infrastructure for zero overhead, my goal.

00:45.880 --> 00:51.400
We look at the motivation, why copy is bad, I hope everyone knows why, but, you know,

00:51.400 --> 01:00.040
I'll say a few words, a protocol solution, you'll go through a demo that I have, it's on

01:00.040 --> 01:07.520
GitHub, I'm not sure it compiles, it's a good luck, it will improve in your course.

01:07.560 --> 01:14.560
And I'll be talking about the usage, right, or how we, I'm playing to you.

01:14.560 --> 01:25.040
So that I'm movement that, and we're going to focus on a very specific location on the

01:25.040 --> 01:27.360
boundary between the kernel and the user space, right?

01:27.360 --> 01:34.960
So when you receive data or send data, of course, the kernel uses space boundary and by

01:34.960 --> 01:43.680
it's removing, my, your cache gets polluted, your cycles are getting wasted, and it's,

01:43.680 --> 01:46.000
it's been on your performance.

01:46.000 --> 01:49.680
So copy is bad, it's not used.

01:49.680 --> 01:57.120
Usually what we have, as I mentioned, you will have this, a neat network interface card,

01:57.120 --> 02:01.560
you know, it actually does move data, it does use some of the bugs, but we're talking

02:01.640 --> 02:03.480
about the CPU, right?

02:03.480 --> 02:08.520
So you'll have the kernel buffer, and you'll have one copy into the user, and for example,

02:08.520 --> 02:14.920
for the focus of a desktop today, and this still, we're talking about proxy systems.

02:14.920 --> 02:23.480
So for something like a Ganesha, or, you know, something easier, like,

02:23.480 --> 02:29.960
managed to write something, or a CBIN, that's a better example, where you have data that you

02:29.960 --> 02:33.560
want to cache on your side, and then for it, right?

02:33.560 --> 02:39.560
So when you are not actually interested in the data itself,

02:39.560 --> 02:46.440
but rather you're a proxy system, you have data coming in, and you have to send it out.

02:46.440 --> 02:49.800
Once or twice, or how many times that you would like,

02:49.800 --> 02:53.880
but none of the times you're actually interested in reading the information itself.

02:53.880 --> 02:59.000
So if you don't, you don't interested, let's see if we move this capability, what we can get,

02:59.080 --> 03:00.680
and we can get a lot, right?

03:00.680 --> 03:04.360
So usually, you'll have this buffer, you copy into the user,

03:04.360 --> 03:10.680
you cache it if it's a CBIN, or some kind of cache, and then you copy it back each and every time

03:10.680 --> 03:13.480
when you have to send it, right?

03:13.480 --> 03:15.640
So how many CPU cycles?

03:18.040 --> 03:21.880
A lot, it depends a lot on the speed of your network archive.

03:21.880 --> 03:29.720
So the working size of your memory really matters.

03:29.720 --> 03:36.440
So it's kind of, it's a number, it makes sense, but really depends on a lot of different factors

03:36.440 --> 03:38.840
on how much memory you're using.

03:38.840 --> 03:42.280
When you're not copying data, you're using much less memory,

03:42.280 --> 03:48.920
or fewer bytes, so this will improve your performance.

03:48.920 --> 03:52.280
The lot, again, depending on your exact use case.

03:52.280 --> 03:58.360
So what's your overhead elements, and why I'm calling it zero overhead,

03:58.360 --> 03:59.960
you know, zero copying?

03:59.960 --> 04:03.880
Usually have copies, but you do have a message,

04:03.880 --> 04:07.960
zero copy, and other stuff inside the account today, already.

04:07.960 --> 04:15.240
They all have their own problems, right?

04:15.240 --> 04:23.880
They inefficient in some ways, they're trying to remap your page tables on the fly,

04:23.880 --> 04:25.720
and it hurts performance, right?

04:25.720 --> 04:32.280
So with zero overhead, it means your information moves from the Nick to

04:32.280 --> 04:38.520
and back to itself, basically, without any overhead, just the control plane,

04:38.520 --> 04:44.600
and your data plane doesn't move any bytes, and it doesn't manipulate any metadata

04:44.600 --> 04:47.960
that you would need in order to avoid moving bytes.

04:47.960 --> 04:50.280
It's just there, right?

04:50.280 --> 04:56.840
So now that you're aware of the magical capabilities of our solution,

04:56.840 --> 05:00.760
let's kind of see what it actually does.

05:00.760 --> 05:03.960
Okay, actually, repeating myself here.

05:03.960 --> 05:12.440
So the idea is that because you're not moving data,

05:12.440 --> 05:16.600
you actually keep the data inside the account, right?

05:16.600 --> 05:20.840
It stays inside the account, it stays inside the account, it stays inside the account buffers.

05:20.840 --> 05:27.720
You, as the proxy system, only a handler, and you're kind of offset and size,

05:27.720 --> 05:30.520
which you want to do when it's sent, right?

05:30.520 --> 05:34.120
So we only have the buffer, it is socket IDs,

05:34.120 --> 05:36.440
offset and left as I mentioned.

05:36.440 --> 05:40.440
There are no pointers to the data, the handlers that you can use,

05:40.520 --> 05:43.640
you know, to address the data and say, here, I'm sending it,

05:43.640 --> 05:49.880
this, it's a handler, not a pointer, right, because you can't actually access it.

05:49.880 --> 05:56.200
And all the kernel IO happens exclusively inside the kernel, right?

05:56.200 --> 06:04.520
User space, in this example, we're using IO-uring to communicate with our

06:04.600 --> 06:05.800
kernel driver, that does it.

06:05.800 --> 06:13.480
I will give an example of a specific, specific use case, a bit later, right?

06:15.080 --> 06:18.840
So what the user does do, right?

06:18.840 --> 06:27.320
So we allocate buffer handles, this is a solution to a problem that we have today.

06:27.320 --> 06:34.360
Mainly, we kind of make sure that there is a slot in our kernel

06:35.320 --> 06:38.920
driver space to for allocated buffers.

06:38.920 --> 06:43.080
It's just to make sure that we can manage our memory in a good way.

06:43.080 --> 06:46.680
We can get back pressure in the kernel sockets and things don't explode.

06:47.400 --> 06:51.880
It's not in necessity, it's a solution today, right?

06:53.720 --> 06:59.560
In a good way, we just, you know, we still pack it and get a pointer or a way of

06:59.880 --> 07:04.760
addressing your bytes in a way that the kernel in the sense, right?

07:06.200 --> 07:08.600
But today we pre-allocate the buffers and we know

07:09.720 --> 07:13.640
which, the buffers are actually descriptors, right? We'll pre-allocate descriptors.

07:14.280 --> 07:20.280
And we have them ready and we know that the TCP sockets can handle it and we can manage them.

07:20.840 --> 07:22.760
I'll give an example, it's a bit abstract.

07:24.040 --> 07:29.480
We create and manage sockets, but we, again, the sockets, we are using our kernel sockets.

07:30.520 --> 07:35.880
So we have a kernel driver that creates kernel sockets and what you have in use space

07:36.840 --> 07:45.080
is an obstruction. We have this library that gives you, it provides you a socket like a API.

07:45.640 --> 07:50.520
But the sockets themselves that are using, you know, TCP, you to people, whatever you like,

07:50.520 --> 07:59.000
answer the kernel, right? So we get received completionifications and you can request

07:59.080 --> 08:01.880
to get a peak inside the data that you're seeing, right?

08:02.680 --> 08:07.320
Because when you receive data, you may want to look at the headers, for example, right?

08:07.320 --> 08:11.320
And if it's kind of type of TLV information, right? This is what you got.

08:11.960 --> 08:17.560
This is the size of it and you know, the offset of whatever you want to do, right?

08:17.560 --> 08:23.240
You don't actually need to see the whole data, you might need to see parts of it, like, small parts of it.

08:24.040 --> 08:28.680
And so, but you can access the data in any way, right?

08:30.440 --> 08:36.680
This is the kernel architectural, basically, as I've described, you have this DevSafeK process.

08:37.480 --> 08:46.280
Hender, like, a fuse, and you have this buffer pool inside the kernel.

08:46.280 --> 08:51.720
You have the socket manager and you have kind of a zilcopy iron engine, it's kind of an obstruction pins.

08:52.680 --> 09:02.680
That there are changes to be made inside the kernel sockets, or actually, oh, oh, okay.

09:02.680 --> 09:07.880
Now this class is a bit later, it's a bit one like to talk about it, but that's the architecture.

09:07.880 --> 09:16.760
You have the kernel sockets, you have the buffers, and they have the application that one of the uses potential use cases is a fuse.

09:17.480 --> 09:23.400
Again, we'll discuss it a bit later, we have the handlers, handles, you know, can create it,

09:24.120 --> 09:31.880
and we have the actual binaries, the library for, for, for interaction, we work kernel device.

09:33.640 --> 09:40.440
So, data flow, we see, we see buff, so we have the, we have the, we have the, we have the kernel

09:40.520 --> 09:46.520
machine, the kernel is a message, which, the problem that we have in today's implementation,

09:46.520 --> 09:53.160
that the kernel API copies the bytes inside the kernel as well, right, it has the same copy semantics.

09:53.800 --> 09:59.720
But it's not a given, right, there are copies that can be used, and you can get the IOVect that, you know,

09:59.720 --> 10:03.720
from the actual buffers that you, that we'll receive from the link itself, right.

10:03.720 --> 10:09.880
So it's an implementation limitation today, because my focus was to get the infrastructure going

10:09.880 --> 10:15.400
and then to actually make sure that, you know, the bits and bytes actually do not copy, right.

10:16.040 --> 10:20.440
So you have the kernel received message, and it goes into our data buffer flow,

10:20.440 --> 10:29.080
that's kind of the pre-allocation that we have today already, right. So, in the correct implementation,

10:29.080 --> 10:33.960
we have the correct, we have zero copies, right, and you have this buffer ID, which today

10:34.040 --> 10:42.760
actually points to an actual buffer, actual size, but there is a point to an IOVect of receive pages,

10:42.760 --> 10:50.920
right, again, they have the sizes, but it's not a single buffer, it's an IOVect, as we receive from the

10:51.720 --> 11:01.400
net also, yeah, after the speed. That's the forward path, you, you receive the buffer,

11:01.480 --> 11:05.480
you have this ID, and then you send it, right, but again, you're going to send it on an actual

11:05.480 --> 11:14.440
buffer, so get you send it to a device, and it sent via the kernel, uh, kernel socket, like again,

11:15.480 --> 11:27.880
true zero copies need to be a handlebar, okay. Why use space initiates and and price, and a couple

11:27.880 --> 11:34.280
of excuses, mainly I didn't get to fixing it properly, but you still need that pressure on your

11:34.280 --> 11:40.440
TCP circuits, right, when you have two circuits, and you combine in them, one of the projects that

11:40.440 --> 11:44.440
you go and I will involve a couple of years ago, it's just connecting two TCP circuits,

11:45.160 --> 11:50.760
you'll get to this, you, you will get an improvement in the TCP performance, they just

11:50.840 --> 11:59.400
amagic of it, just look at the KDCP, it's on YouTube, uh, laughter sentty means that,

12:01.400 --> 12:09.720
there is no kind of several bugs that you can handle, get into, uh, and no hot buffer locations,

12:09.720 --> 12:13.880
right, so you pre-all look at the memory, you don't know how to get anything on the fly,

12:13.880 --> 12:19.240
and simply they happen, like, one, two, three, four, five, instead of handling, uh, K solar and

12:19.320 --> 12:24.840
saying, hey, this is the kernel address, now we need to remove the base and kind of the opposite,

12:24.840 --> 12:30.440
it's such a very complicated, but for the initial implementation, it's kind of, the sum of the

12:30.440 --> 12:40.440
problems, uh, kind of resulted in our, uh, in our module that having a prime, uh, the memory.

12:41.480 --> 12:47.400
So, kind of components today, we have the buffer publication, we have the soccer management, we have the

12:47.480 --> 12:53.880
are you in interface to interface between your application, and you can, the kernel sockets,

12:55.080 --> 13:03.000
two, zero copy, it's in the works, you have this user space library that you can create context,

13:03.000 --> 13:09.640
destroy context, allocate buffers, uh, free them, uh, socket, listen, uh, you can be, you know,

13:09.720 --> 13:20.520
sending receiving, a, basically, a socket again, and here is kind of a chunk of, uh, an actual

13:20.520 --> 13:27.240
code, you can get it in, in, uh, on GitHub, like, a, you receive buffer, and then,

13:28.280 --> 13:34.120
what's kind of, uh, it's not a very good chunk of memory, but here, you receive bytes, and now,

13:34.680 --> 13:41.080
you send them with the context that you created, and this is the out socket, and this is the buffer,

13:41.080 --> 13:49.560
right, it's a handle, offset, if you want to send it all, length, and, and some type, right,

13:50.360 --> 13:56.360
and here, you just flush it, it's an IO, you ring, uh, application, and you just sent the,

13:57.320 --> 14:06.760
not the bytes, right, you just sent the metadata, that's all we did, mate, uh, forms expectations,

14:06.760 --> 14:10.680
we're talking about, we're talking about, probably, from previous experience, we're working in

14:11.640 --> 14:18.440
40 gigabytes, 100 gigabytes, again, if you're working with a 10 gigabytes, uh,

14:19.640 --> 14:24.840
you're probably okay with copying as well, but again, it very depends, depending on,

14:25.720 --> 14:38.120
your exact use case, uh, so with traditional, you have two copies, uh, it's there, right,

14:38.760 --> 14:45.240
you copy a lot of bytes, and you have some latency, again, because you're, you're

14:45.320 --> 14:52.760
disrupting the cache and, uh, and wasting cycles on, on copying, you're not efficient, right,

14:52.760 --> 14:59.560
M of zero copy, you don't have actual copying, but you are there inside very critical

14:59.560 --> 15:05.560
paths of your, of your, uh, of your process, managing your cache levels, and it's inefficient.

15:06.440 --> 15:14.280
Zero overhead means that you have none of it, right, you just, we see in handlers and sending

15:15.160 --> 15:25.080
the way. So, why say fuse, why are we having this discussion in, in, in, as, as there's,

15:26.200 --> 15:30.760
is because one of the use cases that we want to reverse is the few splined, right,

15:31.800 --> 15:40.120
so fuse is a, I don't know, how, how many are familiar with it, basically, a use space interface

15:40.200 --> 15:48.520
for your abstract, uh, virtual file system, right, so you have fuse in, uh, fuse kind of driver,

15:49.400 --> 15:56.680
and all the callbacks for a regular file system, operations are delegated back to the,

15:56.680 --> 16:03.720
use space color, right. So, uh, we have the use space client, and we want to

16:04.680 --> 16:13.160
receive the bytes, uh, from the network and keep inside the canal, and then push them back into the

16:13.160 --> 16:20.040
fuse client and back to the user. So, today, if you've client, uh, we'll receive the bytes,

16:20.040 --> 16:26.280
copy them into user space, then copy them back into fuse, and then maybe copy it back, I'm not sure

16:26.360 --> 16:34.680
about it, uh, back to the user, again, right. So, it's free copy, with our solution, uh, the initial

16:34.680 --> 16:43.080
copy is missing, obviously, the path between the, uh, network and the fuse, the couple of ways

16:43.080 --> 16:49.320
to go about it, uh, I'm, I'm, can split between splies and, and maybe see if I, you can be optimized.

16:50.280 --> 16:57.400
But if it's done, basically, we'll have Dr. Arriving, the logic of a fuse client remaining in

16:57.400 --> 17:03.400
user space, and the actual bytes, moving, we are one of the available mechanisms,

17:04.200 --> 17:12.440
into the fuse canal driver, and then to the user, right. So, it's both, uh, send and receive,

17:12.520 --> 17:19.720
just, you know, the other way around. So, basically, aside from fuse providing the bytes to the

17:19.720 --> 17:29.720
user, there are no copies. Um, okay, skip the, uh, so, you have splies, I mean, maybe you

17:29.720 --> 17:37.960
only be pf, I don't know, I was, uh, just think about it. Implementation status, as I mentioned,

17:38.040 --> 17:43.640
most of the things are done, and I need to choose your copy, then we'll talk about fuse integration,

17:45.320 --> 17:50.040
some performance benchmark, our demo support, I know, just, you know, vibing.

17:53.400 --> 18:00.600
What else do we need? So, uh, library integration, standalone demo is available right now.

18:01.320 --> 18:05.800
Again, you need to see if what's ups, ups, you know, up there is a compiling.

18:08.920 --> 18:14.600
Then we will need to integrate with the fuse, uh, and use capers for, you know,

18:14.600 --> 18:22.520
such a large operation, right. What else we can do? We have additional things in, in, in, um,

18:22.520 --> 18:28.200
as they are, like, or are self-specifically Ganesha, right. We've Ganesha is a user space process

18:28.840 --> 18:34.840
that's, on one hand, is a self-clined, and on the other is NFS exporter, right.

18:35.480 --> 18:40.120
But it doesn't actually need to touch any of the bytes that it's, uh, you know, servicing.

18:41.400 --> 18:48.040
So having Ganesha use this kind of solution is ideal. It will just collect, identify as from the

18:48.040 --> 18:55.240
canal, keep the buffers wherever they may be, depending on the, catching, uh, algorithm or

18:55.240 --> 19:01.240
catching policy that Ganesha may have at hand at the moment, and then service it as many times as

19:01.320 --> 19:06.760
it needs to, right. So Ganesha is the ideal on the client, but it's kind of, you know,

19:07.320 --> 19:14.840
down the road because there's, you know, many complex things. Some other, there are non-file

19:14.840 --> 19:21.400
system solutions, uh, CDMs, right. So when you have a file stream, like a movie stream, right,

19:21.400 --> 19:27.240
you just receive it, and then whenever a client that's close to you needs to see whatever,

19:27.560 --> 19:35.080
you know, whatever, Keden, uh, Denson Keden, you want to see, you know, it's a service from the canal.

19:37.160 --> 19:45.320
Uh, some of Nankesh D, uh, read this again for the same, uh, use case. You receive whatever,

19:46.120 --> 19:52.200
you want to service, but you keep it inside the kelom memory, you keep the handler in all the

19:52.280 --> 20:02.520
logic that you had remains unmodified. And that's kind of the, uh, general proxy pattern.

20:04.120 --> 20:10.280
Uh, so let's talk about it. The key takeaways, you have zero overhead, it's better than zero

20:10.280 --> 20:18.520
copy, right. None of the usual things that are that we have today with zero copy solutions are here.

20:18.520 --> 20:26.360
So it's, it's, it's zero, right. So zero overhead, it's a handlebase architecture. The reason why we

20:26.360 --> 20:34.120
chose this, it's very unintrusive, right. So, uh, it's a canal module that can be built outside of

20:34.120 --> 20:41.960
three, right. It can be just our own implementation. So I don't really need to talk to a worry

20:42.040 --> 20:50.920
about upstream things, the thing, but you know, it's in the, in the plan, uh, proxy workloads benefit,

20:50.920 --> 20:58.120
like, or everything that I mentioned. And so any feedback, any little feedback, because again,

20:58.120 --> 21:06.280
this is working progress is welcome. Uh, the 50, 70 per the person production is, if you're going

21:06.360 --> 21:15.320
up to a 100, there will be Nick. Yeah, you'll be spending about 70 per, 70% of your, uh, CPU,

21:15.320 --> 21:22.280
just moving bikes. If you go about, you know, 100, there will be links and end up. So I, you know,

21:22.280 --> 21:26.840
going forward, and again, 100, there will be links are here for, you know, being here for

21:26.920 --> 21:36.200
since about 10 years. So, it's now, uh, questions. Thank you.

21:40.200 --> 21:44.920
First of all, first of all, um, if you put your reach out there, it will cost somebody's a symbol of

21:44.920 --> 21:51.880
one, um, in the second question, I have, uh, your age and age also, so you don't have to set

21:51.960 --> 21:58.040
under score in there. That's true. I mean, how is this set specific? Okay. So, uh, it's good

21:58.040 --> 22:02.760
question. So it's a double question. Yes, it's both going to show it's both sound, but it's just an example.

22:02.760 --> 22:09.000
You can use anything else as well. Why does it have Seth? Because I'm bad with names, and we can

22:09.000 --> 22:16.120
change it. There's no, no, no, no, no, no, no, no, no, Seth, uh, yeah, got. What's the difference

22:16.680 --> 22:24.040
between Gladys- bekleres from Ioling? I'm certain. Oh, yes.

22:27.400 --> 22:32.840
For all that, Scott? I also loved assuming the other, ooh, yes.

22:40.200 --> 22:44.600
You can few multiple operations on the master by the whole set.

22:44.600 --> 22:51.600
Okay, I'm not sure can repeat it to the let's take it out later

23:01.320 --> 23:03.320
chemistry couple

23:14.600 --> 23:39.200
So the question is if we might still want redaxes or in this case full redaxes to

23:39.240 --> 23:43.320
Well, I'm the five-star on it's usually done by the Nick today

23:43.600 --> 23:48.280
So what you get is what you get you don't have additional validations on the CPU

23:48.280 --> 23:50.280
Right

23:50.280 --> 23:52.280
We do have a

23:52.600 --> 23:58.400
Peak logic inside of you can get because you receive a stream of bytes that's the CCP

23:58.920 --> 24:03.060
You still need to understand how to divide it logically. So you do have peak and

24:03.060 --> 24:08.380
Pick means yeah, we copy some of the bytes and you know it depends on you

24:08.380 --> 24:10.380
What does you know how many bytes you want?

24:10.380 --> 24:16.380
So from safety for you know for the quality it remains like more a lot of the

24:17.380 --> 24:23.420
Making sure that you know no no bits were flipped is done on the hardware level right both in the receiver and the same side

24:24.820 --> 24:29.820
Don't use the geophile. I think the F5 sides, but I'll end that you'll be good

24:30.820 --> 24:32.820
Okay

24:45.820 --> 24:55.700
So the question is about our day may are they may is is nice a the problem of our day may is twofold one is that you need

24:55.700 --> 25:00.960
Dedicate hardware and you don't have it usually, right even for for rocky

25:01.300 --> 25:06.820
Variants or you know it's over our day may a most people won't have it, right?

25:07.460 --> 25:13.720
And the other is it's kind of a competitive solution in a way like other day may doesn't need this and

25:15.420 --> 25:22.820
The the only benefit for our day may would actually be any application. Actually does need to to read because it

25:22.820 --> 25:24.820
writes into your user space

25:25.340 --> 25:27.340
Memory and then you can access it

25:32.780 --> 25:40.060
It's not a common API. I don't want to talk about I don't know what I implement I be verbs inside of this

25:40.580 --> 25:44.960
It's distinct it services a very specific

25:46.140 --> 25:48.140
Neville proxy

25:48.140 --> 25:55.900
Zero copy solution right for if you're not an act of proxy if you're not proxy in some stream for one side to another

25:57.580 --> 25:59.580
You can do you know

25:59.580 --> 26:01.580
Something else

26:03.660 --> 26:05.660
Okay, if we a time

26:07.660 --> 26:11.180
Okay, if I do the space process talking to use

26:11.900 --> 26:15.580
I want to ship that without if you cover whatever to okay

26:18.380 --> 26:24.860
No, but again, I'm not sure so you use talking at unique circuits and you want to do

26:25.420 --> 26:28.620
zero copy with fuse from user space like

26:30.860 --> 26:32.860
So again fuse

26:32.860 --> 26:36.220
It you know and writing now the user space

26:36.820 --> 26:39.460
File not net of it not a net profile system

26:43.460 --> 26:47.460
Yeah, that's not a net of that's a net of works

26:49.140 --> 26:50.140
All right

26:50.140 --> 26:52.140
We have one more question

26:56.140 --> 27:17.860
Yeah, the user space register the buffer yes, so I am using just the interface to communicate with the kernel and then

27:18.140 --> 27:25.300
The and the and the user there are multiple ways about it. I even just seem the most suitable for it

27:25.820 --> 27:29.340
It's for the interface for the communication or for the data transfer

27:35.180 --> 27:37.180
Thank you folks

27:48.140 --> 27:50.140
You

