zero_copy(9)
NAME
zero_copy, zero_copy_sockets - zero copy sockets code
SYNOPSIS
options ZERO_COPY_SOCKETS
DESCRIPTION
- The FreeBSD kernel includes a facility for eliminating data
- copies on
socket reads and writes. - This code is collectively known as the zero copy sockets
- code, because
during normal network I/O, data will not be copied by the - CPU at all.
Rather it will be DMAed from the user's buffer to the NIC - (for sends), or
DMAed from the NIC to a buffer that will then be given to - the user
(receives). - The zero copy sockets code uses the standard socket read and
- write semantics, and therefore has some limitations and restrictions
- that programmers should be aware of when trying to take advantage of
- this functionality.
- For sending data, there are no special requirements or capa
- bilities that
the sending NIC must have. The data written to the socket, - though, must
be at least a page in size and page aligned in order to be - mapped into
the kernel. If it does not meet the page size and alignment - constraints,
it will be copied into the kernel, as is normally the case - with socket
I/O. - The user should be careful not to overwrite buffers that
- have been written to the socket before the data has been freed by the ker
- nel, and the
copy-on-write mapping cleared. If a buffer is overwritten - before it has
been given up by the kernel, the data will be copied, and no - savings in
CPU utilization and memory bandwidth utilization will be re - alized.
- The socket(2) API does not really give the user any indica
- tion of when
his data has actually been sent over the wire, or when the - data has been
freed from kernel buffers. For protocols like TCP, the data - will be kept
around in the kernel until it has been acknowledged by the - other side; it
must be kept until the acknowledgement is received in case - retransmission
is required. - From an application standpoint, the best way to guarantee
- that the data
has been sent out over the wire and freed by the kernel (for - TCP-based
sockets) is to set a socket buffer size (see the SO_SNDBUF - socket option
in the setsockopt(2) manual page) appropriate for the appli - cation and
network environment and then make sure you have sent out - twice as much
data as the socket buffer size before reusing a buffer. For - TCP, the
send and receive socket buffer sizes generally directly cor - respond to the
TCP window size. - For receiving data, in order to take advantage of the zero
- copy receive
code, the user must have a NIC that is configured for an MTU - greater than
the architecture page size. (E.g., for alpha this would be - 8KB, for
i386, it would be 4KB.) Additionally, in order for zero - copy receive to
work, packet payloads must be at least a page in size and - page aligned.
- Achieving page aligned payloads requires a NIC that can
- split an incoming
packet into multiple buffers. It also generally requires - some sort of
intelligence on the NIC to make sure that the payload starts - in its own
buffer. This is called ``header splitting''. Currently the - only NICs
with support for header splitting are Alteon Tigon 2 based - boards running
slightly modified firmware. The FreeBSD ti(4) driver in - cludes modified
firmware for Tigon 2 boards only. Header splitting code can - be written,
however, for any NIC that allows putting received packets - into multiple
buffers and that has enough programmability to determine - that the header
should go into one buffer and the payload into another. - You can also do a form of header splitting that does not re
- quire any NIC
modifications if your NIC is at least capable of splitting - packets into
multiple buffers. This requires that you optimize the NIC - driver for
your most common packet header size. If that size (ethernet - + IP + TCP
headers) is generally 66 bytes, for instance, you would set - the first
buffer in a set for a particular packet to be 66 bytes long, - and then
subsequent buffers would be a page in size. For packets - that have headers that are exactly 66 bytes long, your payload will be
- page aligned.
- The other requirement for zero copy receive to work is that
- the buffer
that is the destination for the data read from a socket must - be at least
a page in size and page aligned. - Obviously the requirements for receive side zero copy are
- impossible to
meet without NIC hardware that is programmable enough to do - header splitting of some sort. Since most NICs are not that pro
- grammable, or their
manufacturers will not share the source code to their - firmware, this
approach to zero copy receive is not widely useful. - There are other approaches, such as RDMA and TCP Offload,
- that may potentially help alleviate the CPU overhead associated with copy
- ing data out
of the kernel. Most known techniques require some sort of - support at the
NIC level to work, and describing such techniques is beyond - the scope of
this manual page. - The zero copy send and zero copy receive code can be indi
- vidually turned
off via the kern.ipc.zero_copy.send and - kern.ipc.zero_copy.receive sysctl variables respectively.
SEE ALSO
HISTORY
- The zero copy sockets code first appeared in FreeBSD 5.0,
- although it has
been in existence in patch form since at least mid-1999.
AUTHORS
- The zero copy sockets code was originally written by Andrew
- Gallatin
<gallatin@FreeBSD.org> and substantially modified and updat - ed by Kenneth
Merry <ken@FreeBSD.org>. - BSD December 5, 2004