Commit 5bb053be authored by Linus Torvalds's avatar Linus Torvalds

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:

 1) Support offloading wireless authentication to userspace via
    NL80211_CMD_EXTERNAL_AUTH, from Srinivas Dasari.

 2) A lot of work on network namespace setup/teardown from Kirill Tkhai.
    Setup and cleanup of namespaces now all run asynchronously and thus
    performance is significantly increased.

 3) Add rx/tx timestamping support to mv88e6xxx driver, from Brandon
    Streiff.

 4) Support zerocopy on RDS sockets, from Sowmini Varadhan.

 5) Use denser instruction encoding in x86 eBPF JIT, from Daniel
    Borkmann.

 6) Support hw offload of vlan filtering in mvpp2 dreiver, from Maxime
    Chevallier.

 7) Support grafting of child qdiscs in mlxsw driver, from Nogah
    Frankel.

 8) Add packet forwarding tests to selftests, from Ido Schimmel.

 9) Deal with sub-optimal GSO packets better in BBR congestion control,
    from Eric Dumazet.

10) Support 5-tuple hashing in ipv6 multipath routing, from David Ahern.

11) Add path MTU tests to selftests, from Stefano Brivio.

12) Various bits of IPSEC offloading support for mlx5, from Aviad
    Yehezkel, Yossi Kuperman, and Saeed Mahameed.

13) Support RSS spreading on ntuple filters in SFC driver, from Edward
    Cree.

14) Lots of sockmap work from John Fastabend. Applications can use eBPF
    to filter sendmsg and sendpage operations.

15) In-kernel receive TLS support, from Dave Watson.

16) Add XDP support to ixgbevf, this is significant because it should
    allow optimized XDP usage in various cloud environments. From Tony
    Nguyen.

17) Add new Intel E800 series "ice" ethernet driver, from Anirudh
    Venkataramanan et al.

18) IP fragmentation match offload support in nfp driver, from Pieter
    Jansen van Vuuren.

19) Support XDP redirect in i40e driver, from Björn Töpel.

20) Add BPF_RAW_TRACEPOINT program type for accessing the arguments of
    tracepoints in their raw form, from Alexei Starovoitov.

21) Lots of striding RQ improvements to mlx5 driver with many
    performance improvements, from Tariq Toukan.

22) Use rhashtable for inet frag reassembly, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1678 commits)
  net: mvneta: improve suspend/resume
  net: mvneta: split rxq/txq init and txq deinit into SW and HW parts
  ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh
  net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
  net: bgmac: Correctly annotate register space
  route: check sysctl_fib_multipath_use_neigh earlier than hash
  fix typo in command value in drivers/net/phy/mdio-bitbang.
  sky2: Increase D3 delay to sky2 stops working after suspend
  net/mlx5e: Set EQE based as default TX interrupt moderation mode
  ibmvnic: Disable irqs before exiting reset from closed state
  net: sched: do not emit messages while holding spinlock
  vlan: also check phy_driver ts_info for vlan's real device
  Bluetooth: Mark expected switch fall-throughs
  Bluetooth: Set HCI_QUIRK_SIMULTANEOUS_DISCOVERY for BTUSB_QCA_ROME
  Bluetooth: btrsi: remove unused including <linux/version.h>
  Bluetooth: hci_bcm: Remove DMI quirk for the MINIX Z83-4
  sh_eth: kill useless check in __sh_eth_get_regs()
  sh_eth: add sh_eth_cpu_data::no_xdfar flag
  ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()
  ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()
  ...
parents bb2407a7 159f0297

Too many changes to show.

To preserve performance only 1000 of 1000+ files are displayed.
......@@ -539,6 +539,18 @@ A: Although LLVM IR generation and optimization try to stay architecture
The clang option "-fno-jump-tables" can be used to disable
switch table generation.
- For clang -target bpf, it is guaranteed that pointer or long /
unsigned long types will always have a width of 64 bit, no matter
whether underlying clang binary or default target (or kernel) is
32 bit. However, when native clang target is used, then it will
compile these types based on the underlying architecture's conventions,
meaning in case of 32 bit architecture, pointer or long / unsigned
long types e.g. in BPF context structure will have width of 32 bit
while the BPF LLVM back end still operates in 64 bit. The native
target is mostly needed in tracing for the case of walking pt_regs
or other kernel structures where CPU's register width matters.
Otherwise, clang -target bpf is generally recommended.
You should use default target when:
- Your program includes a header file, e.g., ptrace.h, which eventually
......
......@@ -13,9 +13,18 @@ placed as a child node of an mdio device.
The properties described here are those specific to Marvell devices.
Additional required and optional properties can be found in dsa.txt.
The compatibility string is used only to find an identification register,
which is at a different MDIO base address in different switch families.
- "marvell,mv88e6085" : Switch has base address 0x10. Use with models:
6085, 6095, 6097, 6123, 6131, 6141, 6161, 6165,
6171, 6172, 6175, 6176, 6185, 6240, 6320, 6321,
6341, 6350, 6351, 6352
- "marvell,mv88e6190" : Switch has base address 0x00. Use with models:
6190, 6190X, 6191, 6290, 6390, 6390X
Required properties:
- compatible : Should be one of "marvell,mv88e6085" or
"marvell,mv88e6190"
"marvell,mv88e6190" as indicated above
- reg : Address on the MII bus for the switch.
Optional properties:
......
......@@ -10,6 +10,8 @@ Documentation/devicetree/bindings/phy/phy-bindings.txt.
the boot program; should be used in cases where the MAC address assigned to
the device by the boot program is different from the "local-mac-address"
property;
- nvmem-cells: phandle, reference to an nvmem node for the MAC address;
- nvmem-cell-names: string, should be "mac-address" if nvmem is to be used;
- max-speed: number, specifies maximum speed in Mbit/s supported by the device;
- max-frame-size: number, maximum transfer unit (IEEE defined MTU), rather than
the maximum frame size (there's contradiction in the Devicetree
......
* MCR20A IEEE 802.15.4 *
Required properties:
- compatible: should be "nxp,mcr20a"
- spi-max-frequency: maximal bus speed, should be set to a frequency
lower than 9000000 depends sync or async operation mode
- reg: the chipselect index
- interrupts: the interrupt generated by the device. Non high-level
can occur deadlocks while handling isr.
Optional properties:
- rst_b-gpio: GPIO spec for the RST_B pin
Example:
mcr20a@0 {
compatible = "nxp,mcr20a";
spi-max-frequency = <9000000>;
reg = <0>;
interrupts = <17 2>;
interrupt-parent = <&gpio>;
rst_b-gpio = <&gpio 27 1>
};
......@@ -29,6 +29,7 @@ Optional properties for PHY child node:
- reset-gpios : Should specify the gpio for phy reset
- magic-packet : If present, indicates that the hardware supports waking
up via magic packet.
- phy-handle : see ethernet.txt file in the same directory
Examples:
......
......@@ -9,6 +9,7 @@ Required properties on all platforms:
- compatible: Depending on the platform this should be one of:
- "amlogic,meson6-dwmac"
- "amlogic,meson8b-dwmac"
- "amlogic,meson8m2-dwmac"
- "amlogic,meson-gxbb-dwmac"
Additionally "snps,dwmac" and any applicable more
detailed version number described in net/stmmac.txt
......@@ -19,13 +20,13 @@ Required properties on all platforms:
configuration (for example the PRG_ETHERNET register range
on Meson8b and newer)
Required properties on Meson8b and newer:
Required properties on Meson8b, Meson8m2, GXBB and newer:
- clock-names: Should contain the following:
- "stmmaceth" - see stmmac.txt
- "clkin0" - first parent clock of the internal mux
- "clkin1" - second parent clock of the internal mux
Optional properties on Meson8b and newer:
Optional properties on Meson8b, Meson8m2, GXBB and newer:
- amlogic,tx-delay-ns: The internal RGMII TX clock delay (provided
by this driver) in nanoseconds. Allowed values
are: 0ns, 2ns, 4ns, 6ns.
......
* NI XGE Ethernet controller
Required properties:
- compatible: Should be "ni,xge-enet-2.00"
- reg: Address and length of the register set for the device
- interrupts: Should contain tx and rx interrupt
- interrupt-names: Should be "rx" and "tx"
- phy-mode: See ethernet.txt file in the same directory.
- phy-handle: See ethernet.txt file in the same directory.
- nvmem-cells: Phandle of nvmem cell containing the MAC address
- nvmem-cell-names: Should be "address"
Examples (10G generic PHY):
nixge0: ethernet@40000000 {
compatible = "ni,xge-enet-2.00";
reg = <0x40000000 0x6000>;
nvmem-cells = <&eth1_addr>;
nvmem-cell-names = "address";
interrupts = <0 29 IRQ_TYPE_LEVEL_HIGH>, <0 30 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "rx", "tx";
interrupt-parent = <&intc>;
phy-mode = "xgmii";
phy-handle = <&ethernet_phy1>;
ethernet_phy1: ethernet-phy@4 {
compatible = "ethernet-phy-ieee802.3-c45";
reg = <4>;
};
};
......@@ -7,6 +7,7 @@ Required properties:
- compatible: Must contain one or more of the following:
- "renesas,etheravb-r8a7743" for the R8A7743 SoC.
- "renesas,etheravb-r8a7745" for the R8A7745 SoC.
- "renesas,etheravb-r8a77470" for the R8A77470 SoC.
- "renesas,etheravb-r8a7790" for the R8A7790 SoC.
- "renesas,etheravb-r8a7791" for the R8A7791 SoC.
- "renesas,etheravb-r8a7792" for the R8A7792 SoC.
......
......@@ -33,6 +33,10 @@ Optional Properties:
Select (AKA RS1) output gpio signal (SFP+ only), low: low Tx rate, high:
high Tx rate. Must not be present for SFF modules
- maximum-power-milliwatt : Maximum module power consumption
Specifies the maximum power consumption allowable by a module in the
slot, in milli-Watts. Presently, modules can be up to 1W, 1.5W or 2W.
Example #1: Direct serdes to SFP connection
sfp_eth3: sfp-eth3 {
......@@ -40,6 +44,7 @@ sfp_eth3: sfp-eth3 {
i2c-bus = <&sfp_1g_i2c>;
los-gpios = <&cpm_gpio2 22 GPIO_ACTIVE_HIGH>;
mod-def0-gpios = <&cpm_gpio2 21 GPIO_ACTIVE_LOW>;
maximum-power-milliwatt = <1000>;
pinctrl-names = "default";
pinctrl-0 = <&cpm_sfp_1g_pins &cps_sfp_1g_pins>;
tx-disable-gpios = <&cps_gpio1 24 GPIO_ACTIVE_HIGH>;
......
......@@ -9,6 +9,7 @@ Required properties:
- "socionext,uniphier-pxs2-ave4" : for PXs2 SoC
- "socionext,uniphier-ld11-ave4" : for LD11 SoC
- "socionext,uniphier-ld20-ave4" : for LD20 SoC
- "socionext,uniphier-pxs3-ave4" : for PXs3 SoC
- reg: Address where registers are mapped and size of region.
- interrupts: Should contain the MAC interrupt.
- phy-mode: See ethernet.txt in the same directory. Allow to choose
......
......@@ -25,6 +25,8 @@ Optional property:
software needs to take when this pin is
strapped in these modes. See data manual
for details.
- ti,clk-output-sel - Muxing option for CLK_OUT pin - see dt-bindings/net/ti-dp83867.h
for applicable values.
Note: ti,min-output-impedance and ti,max-output-impedance are mutually
exclusive. When both properties are present ti,max-output-impedance
......
Intel(R) Ethernet Connection E800 Series Linux Driver
===================================================================
Intel ice Linux driver.
Copyright(c) 2018 Intel Corporation.
Contents
========
- Enabling the driver
- Support
The driver in this release supports Intel's E800 Series of products. For
more information, visit Intel's support page at http://support.intel.com.
Enabling the driver
===================
The driver is enabled via the standard kernel configuration system,
using the make command:
Make oldconfig/silentoldconfig/menuconfig/etc.
The driver is located in the menu structure at:
-> Device Drivers
-> Network device support (NETDEVICES [=y])
-> Ethernet driver support
-> Intel devices
-> Intel(R) Ethernet Connection E800 Series Support
Support
=======
For general information, go to the Intel support website at:
http://support.intel.com
If an issue is identified with the released source code, please email
the maintainer listed in the MAINTAINERS file.
......@@ -133,14 +133,11 @@ min_adv_mss - INTEGER
IP Fragmentation:
ipfrag_high_thresh - INTEGER
Maximum memory used to reassemble IP fragments. When
ipfrag_high_thresh bytes of memory is allocated for this purpose,
the fragment handler will toss packets until ipfrag_low_thresh
is reached. This also serves as a maximum limit to namespaces
different from the initial one.
ipfrag_low_thresh - INTEGER
ipfrag_high_thresh - LONG INTEGER
Maximum memory used to reassemble IP fragments.
ipfrag_low_thresh - LONG INTEGER
(Obsolete since linux-4.17)
Maximum memory used to reassemble IP fragments before the kernel
begins to remove incomplete fragment queues to free up resources.
The kernel still accepts new fragments for defragmentation.
......@@ -755,13 +752,13 @@ udp_rmem_min - INTEGER
Minimal size of receive buffer used by UDP sockets in moderation.
Each UDP socket is able to use the size for receiving data, even if
total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
Default: 1 page
Default: 4K
udp_wmem_min - INTEGER
Minimal size of send buffer used by UDP sockets in moderation.
Each UDP socket is able to use the size for sending data, even if
total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
Default: 1 page
Default: 4K
CIPSOv4 Variables:
......@@ -1363,6 +1360,13 @@ flowlabel_reflect - BOOLEAN
FALSE: disabled
Default: FALSE
fib_multipath_hash_policy - INTEGER
Controls which hash policy to use for multipath routes.
Default: 0 (Layer 3)
Possible values:
0 - Layer 3 (source and destination addresses plus flow label)
1 - Layer 4 (standard 5-tuple)
anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
echo reply
......@@ -1696,7 +1700,9 @@ disable_ipv6 - BOOLEAN
interface and start Duplicate Address Detection, if necessary.
When this value is changed from 0 to 1 (IPv6 is being disabled),
it will dynamically delete all address on the given interface.
it will dynamically delete all addresses and routes on the given
interface. From now on it will not possible to add addresses/routes
to the selected interface.
accept_dad - INTEGER
Whether to accept DAD (Duplicate Address Detection).
......@@ -2094,7 +2100,7 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max
It is guaranteed to each SCTP socket (but not association) even
under moderate memory pressure.
Default: 1 page
Default: 4K
sctp_wmem - vector of 3 INTEGERs: min, default, max
Currently this tunable has no effect.
......
......@@ -72,11 +72,6 @@ this flag, a process must first signal intent by setting a socket option:
if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
error(1, errno, "setsockopt zerocopy");
Setting the socket option only works when the socket is in its initial
(TCP_CLOSED) state. Trying to set the option for a socket returned by accept(),
for example, will lead to an EBUSY error. In this case, the option should be set
to the listening socket and it will be inherited by the accepted sockets.
Transmission
------------
......
Net DIM - Generic Network Dynamic Interrupt Moderation
======================================================
Author:
Tal Gilboa <talgi@mellanox.com>
Contents
=========
- Assumptions
- Introduction
- The Net DIM Algorithm
- Registering a Network Device to DIM
- Example
Part 0: Assumptions
======================
This document assumes the reader has basic knowledge in network drivers
and in general interrupt moderation.
Part I: Introduction
======================
Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
interrupt moderation configuration of a channel in order to optimize packet
processing. The mechanism includes an algorithm which decides if and how to
change moderation parameters for a channel, usually by performing an analysis on
runtime data sampled from the system. Net DIM is such a mechanism. In each
iteration of the algorithm, it analyses a given sample of the data, compares it
to the previous sample and if required, it can decide to change some of the
interrupt moderation configuration fields. The data sample is composed of data
bandwidth, the number of packets and the number of events. The time between
samples is also measured. Net DIM compares the current and the previous data and
returns an adjusted interrupt moderation configuration object. In some cases,
the algorithm might decide not to change anything. The configuration fields are
the minimum duration (microseconds) allowed between events and the maximum
number of wanted packets per event. The Net DIM algorithm ascribes importance to
increase bandwidth over reducing interrupt rate.
Part II: The Net DIM Algorithm
===============================
Each iteration of the Net DIM algorithm follows these steps:
1. Calculates new data sample.
2. Compares it to previous sample.
3. Makes a decision - suggests interrupt moderation configuration fields.
4. Applies a schedule work function, which applies suggested configuration.
The first two steps are straightforward, both the new and the previous data are
supplied by the driver registered to Net DIM. The previous data is the new data
supplied to the previous iteration. The comparison step checks the difference
between the new and previous data and decides on the result of the last step.
A step would result as "better" if bandwidth increases and as "worse" if
bandwidth reduces. If there is no change in bandwidth, the packet rate is
compared in a similar fashion - increase == "better" and decrease == "worse".
In case there is no change in the packet rate as well, the interrupt rate is
compared. Here the algorithm tries to optimize for lower interrupt rate so an
increase in the interrupt rate is considered "worse" and a decrease is
considered "better". Step #2 has an optimization for avoiding false results: it
only considers a difference between samples as valid if it is greater than a
certain percentage. Also, since Net DIM does not measure anything by itself, it
assumes the data provided by the driver is valid.
Step #3 decides on the suggested configuration based on the result from step #2
and the internal state of the algorithm. The states reflect the "direction" of
the algorithm: is it going left (reducing moderation), right (increasing
moderation) or standing still. Another optimization is that if a decision
to stay still is made multiple times, the interval between iterations of the
algorithm would increase in order to reduce calculation overhead. Also, after
"parking" on one of the most left or most right decisions, the algorithm may
decide to verify this decision by taking a step in the other direction. This is
done in order to avoid getting stuck in a "deep sleep" scenario. Once a
decision is made, an interrupt moderation configuration is selected from
the predefined profiles.
The last step is to notify the registered driver that it should apply the
suggested configuration. This is done by scheduling a work function, defined by
the Net DIM API and provided by the registered driver.
As you can see, Net DIM itself does not actively interact with the system. It
would have trouble making the correct decisions if the wrong data is supplied to
it and it would be useless if the work function would not apply the suggested
configuration. This does, however, allow the registered driver some room for
manoeuvre as it may provide partial data or ignore the algorithm suggestion
under some conditions.
Part III: Registering a Network Device to DIM
==============================================
Net DIM API exposes the main function net_dim(struct net_dim *dim,
struct net_dim_sample end_sample). This function is the entry point to the Net
DIM algorithm and has to be called every time the driver would like to check if
it should change interrupt moderation parameters. The driver should provide two
data structures: struct net_dim and struct net_dim_sample. Struct net_dim
describes the state of DIM for a specific object (RX queue, TX queue,
other queues, etc.). This includes the current selected profile, previous data
samples, the callback function provided by the driver and more.
Struct net_dim_sample describes a data sample, which will be compared to the
data sample stored in struct net_dim in order to decide on the algorithm's next
step. The sample should include bytes, packets and interrupts, measured by
the driver.
In order to use Net DIM from a networking driver, the driver needs to call the
main net_dim() function. The recommended method is to call net_dim() on each
interrupt. Since Net DIM has a built-in moderation and it might decide to skip
iterations under certain conditions, there is no need to moderate the net_dim()
calls as well. As mentioned above, the driver needs to provide an object of type
struct net_dim to the net_dim() function call. It is advised for each entity
using Net DIM to hold a struct net_dim as part of its data structure and use it
as the main Net DIM API object. The struct net_dim_sample should hold the latest
bytes, packets and interrupts count. No need to perform any calculations, just
include the raw data.
The net_dim() call itself does not return anything. Instead Net DIM relies on
the driver to provide a callback function, which is called when the algorithm
decides to make a change in the interrupt moderation parameters. This callback
will be scheduled and run in a separate thread in order not to add overhead to
the data flow. After the work is done, Net DIM algorithm needs to be set to
the proper state in order to move to the next iteration.
Part IV: Example
=================
The following code demonstrates how to register a driver to Net DIM. The actual
usage is not complete but it should make the outline of the usage clear.
my_driver.c:
#include <linux/net_dim.h>
/* Callback for net DIM to schedule on a decision to change moderation */
void my_driver_do_dim_work(struct work_struct *work)
{
/* Get struct net_dim from struct work_struct */
struct net_dim *dim = container_of(work, struct net_dim,
work);
/* Do interrupt moderation related stuff */
...
/* Signal net DIM work is done and it should move to next iteration */
dim->state = NET_DIM_START_MEASURE;
}
/* My driver's interrupt handler */
int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
{
...
/* A struct to hold current measured data */
struct net_dim_sample dim_sample;
...
/* Initiate data sample struct with current data */
net_dim_sample(my_entity->events,
my_entity->packets,
my_entity->bytes,
&dim_sample);
/* Call net DIM */
net_dim(&my_entity->dim, dim_sample);
...
}
/* My entity's initialization function (my_entity was already allocated) */
int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
{
...
/* Initiate struct work_struct with my driver's callback function */
INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
...
}
Netfilter's flowtable infrastructure
====================================
This documentation describes the software flowtable infrastructure available in
Netfilter since Linux kernel 4.16.
Overview
--------
Initial packets follow the classic forwarding path, once the flow enters the
established state according to the conntrack semantics (ie. we have seen traffic
in both directions), then you can decide to offload the flow to the flowtable
from the forward chain via the 'flow offload' action available in nftables.
Packets that find an entry in the flowtable (ie. flowtable hit) are sent to the
output netdevice via neigh_xmit(), hence, they bypass the classic forwarding
path (the visible effect is that you do not see these packets from any of the
netfilter hooks coming after the ingress). In case of flowtable miss, the packet
follows the classic forward path.
The flowtable uses a resizable hashtable, lookups are based on the following
7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source
and destination ports and the input interface (useful in case there are several
conntrack zones in place).
Flowtables are populated via the 'flow offload' nftables action, so the user can
selectively specify what flows are placed into the flow table. Hence, packets
follow the classic forwarding path unless the user explicitly instruct packets
to use this new alternative forwarding path via nftables policy.
This is represented in Fig.1, which describes the classic forwarding path
including the Netfilter hooks and the flowtable fastpath bypass.
userspace process
^ |
| |
_____|____ ____\/___
/ \ / \
| input | | output |
\__________/ \_________/
^ |
| |
_________ __________ --------- _____\/_____
/ \ / \ |Routing | / \
--> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
\_________/ \__________/ ---------- \____________/ ^
| ^ | | ^ |
flowtable | | ____\/___ | |
| | | / \ | |
__\/___ | --------->| forward |------------ |
|-----| | \_________/ |
|-----| | 'flow offload' rule |
|-----| | adds entry to |
|_____| | flowtable |
| | |
/ \ | |
/hit\_no_| |
\ ? / |
\ / |
|__yes_________________fastpath bypass ____________________________|
Fig.1 Netfilter hooks and flowtable interactions
The flowtable entry also stores the NAT configuration, so all packets are
mangled according to the NAT policy that matches the initial packets that went
through the classic forwarding path. The TTL is decremented before calling
neigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding
path given that the transport selectors are missing, therefore flowtable lookup
is not possible.
Example configuration
---------------------
Enabling the flowtable bypass is relatively easy, you only need to create a
flowtable and add one rule to your forward chain.
table inet x {
flowtable f {
hook ingress priority 0 devices = { eth0, eth1 };
}
chain y {
type filter hook forward priority 0; policy accept;
ip protocol tcp flow offload @f
counter packets 0 bytes 0
}
}
This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
netdevices. You can create as many flowtables as you want in case you need to
perform resource partitioning. The flowtable priority defines the order in which
hooks are run in the pipeline, this is convenient in case you already have a
nftables ingress chain (make sure the flowtable priority is smaller than the
nftables ingress chain hence the flowtable runs before in the pipeline).
The 'flow offload' action from the forward chain 'y' adds an entry to the
flowtable for the TCP syn-ack packet coming in the reply direction. Once the
flow is offloaded, you will observe that the counter rule in the example above
does not get updated for the packets that are being forwarded through the
forwarding bypass.
More reading
------------
This documentation is based on the LWN.net articles [1][2]. Rafal Milecki also
made a very complete and comprehensive summary called "A state of network
acceleration" that describes how things were before this infrastructure was
mailined [3] and it also makes a rough summary of this work [4].
[1] https://lwn.net/Articles/738214/
[2] https://lwn.net/Articles/742164/
[3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
[4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html
......@@ -7,15 +7,12 @@ socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
i) capture network traffic with utilities like tcpdump, ii) transmit network