summaryrefslogtreecommitdiff
path: root/net/sched/sch_api.c
Commit message (Collapse)AuthorAgeFilesLines
* net: sched: remove qdisc_watchdog->last_expiresEric Dumazet2023-03-091-2/+4
| | | | | | | | | This field mirrors hrtimer softexpires, we can instead use the existing helpers. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230308182648.1150762-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/sched: fix error recovery in qdisc_create()Eric Dumazet2023-02-131-7/+8
| | | | | | | | | | | | | | | | If TCA_STAB attribute is malformed, qdisc_get_stab() returns an error, and we end up calling ops->destroy() while ops->init() has not been called yet. While we are at it, call qdisc_put_stab() after ops->destroy(). Fixes: 1f62879e3632 ("net/sched: make stab available before ops->init() call") Reported-by: syzbot+d44d88f1d11e6ca8576b@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Cc: Kurt Kanzenbach <kurt@linutronix.de> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: make stab available before ops->init() callVladimir Oltean2023-02-081-18/+11
| | | | | | | | | | | | | | | | | | | | | | Some qdiscs like taprio turn out to be actually pretty reliant on a well configured stab, to not underestimate the skb transmission time (by properly accounting for L1 overhead). In a future change, taprio will need the stab, if configured by the user, to be available at ops->init() time. It will become even more important in upcoming work, when the overhead will be used for the queueMaxSDU calculation that is passed to an offloading driver. However, rcu_assign_pointer(sch->stab, stab) is called right after ops->init(), making it unavailable, and I don't really see a good reason for that. Move it earlier, which nicely seems to simplify the error handling path as well. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* sched: add new attr TCA_EXT_WARN_MSG to report tc extact messageHangbin Liu2023-01-171-21/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We will report extack message if there is an error via netlink_ack(). But if the rule is not to be exclusively executed by the hardware, extack is not passed along and offloading failures don't get logged. In commit 81c7288b170a ("sched: cls: enable verbose logging") Marcelo made cls could log verbose info for offloading failures, which helps improving Open vSwitch debuggability when using flower offloading. It would also be helpful if userspace monitor tools, like "tc monitor", could log this kind of message, as it doesn't require vswitchd log level adjusment. Let's add a new tc attributes to report the extack message so the monitor program could receive the failures. e.g. # tc monitor added chain dev enp3s0f1np1 parent ffff: chain 0 added filter dev enp3s0f1np1 ingress protocol all pref 49152 flower chain 0 handle 0x1 ct_state +trk+new not_in_hw action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 Warning: mlx5_core: matching on ct_state +new isn't supported. In this patch I only report the extack message on add/del operations. It doesn't look like we need to report the extack message on get/dump operations. Note this message not only reporte to multicast groups, it could also be reported unicast, which may affect the current usersapce tool's behaivor. Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20230113034353.2766735-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* net: sched: disallow noqueue for qdisc classesFrederick Lawler2023-01-101-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While experimenting with applying noqueue to a classful queue discipline, we discovered a NULL pointer dereference in the __dev_queue_xmit() path that generates a kernel OOPS: # dev=enp0s5 # tc qdisc replace dev $dev root handle 1: htb default 1 # tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit # tc qdisc add dev $dev parent 1:1 handle 10: noqueue # ping -I $dev -w 1 -c 1 1.1.1.1 [ 2.172856] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 2.173217] #PF: supervisor instruction fetch in kernel mode ... [ 2.178451] Call Trace: [ 2.178577] <TASK> [ 2.178686] htb_enqueue+0x1c8/0x370 [ 2.178880] dev_qdisc_enqueue+0x15/0x90 [ 2.179093] __dev_queue_xmit+0x798/0xd00 [ 2.179305] ? _raw_write_lock_bh+0xe/0x30 [ 2.179522] ? __local_bh_enable_ip+0x32/0x70 [ 2.179759] ? ___neigh_create+0x610/0x840 [ 2.179968] ? eth_header+0x21/0xc0 [ 2.180144] ip_finish_output2+0x15e/0x4f0 [ 2.180348] ? dst_output+0x30/0x30 [ 2.180525] ip_push_pending_frames+0x9d/0xb0 [ 2.180739] raw_sendmsg+0x601/0xcb0 [ 2.180916] ? _raw_spin_trylock+0xe/0x50 [ 2.181112] ? _raw_spin_unlock_irqrestore+0x16/0x30 [ 2.181354] ? get_page_from_freelist+0xcd6/0xdf0 [ 2.181594] ? sock_sendmsg+0x56/0x60 [ 2.181781] sock_sendmsg+0x56/0x60 [ 2.181958] __sys_sendto+0xf7/0x160 [ 2.182139] ? handle_mm_fault+0x6e/0x1d0 [ 2.182366] ? do_user_addr_fault+0x1e1/0x660 [ 2.182627] __x64_sys_sendto+0x1b/0x30 [ 2.182881] do_syscall_64+0x38/0x90 [ 2.183085] entry_SYSCALL_64_after_hwframe+0x63/0xcd ... [ 2.187402] </TASK> Previously in commit d66d6c3152e8 ("net: sched: register noqueue qdisc"), NULL was set for the noqueue discipline on noqueue init so that __dev_queue_xmit() falls through for the noqueue case. This also sets a bypass of the enqueue NULL check in the register_qdisc() function for the struct noqueue_disc_ops. Classful queue disciplines make it past the NULL check in __dev_queue_xmit() because the discipline is set to htb (in this case), and then in the call to __dev_xmit_skb(), it calls into htb_enqueue() which grabs a leaf node for a class and then calls qdisc_enqueue() by passing in a queue discipline which assumes ->enqueue() is not set to NULL. Fix this by not allowing classes to be assigned to the noqueue discipline. Linux TC Notes states that classes cannot be set to the noqueue discipline. [1] Let's enforce that here. Links: 1. https://linux-tc-notes.sourceforge.net/tc/doc/sch_noqueue.txt Fixes: d66d6c3152e8 ("net: sched: register noqueue qdisc") Cc: stable@vger.kernel.org Signed-off-by: Frederick Lawler <fred@cloudflare.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20230109163906.706000-1-fred@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/sched: add retpoline wrapper for tcPedro Tammela2022-12-091-0/+5
| | | | | | | | | | | | | On kernels using retpoline as a spectrev2 mitigation, optimize actions and filters that are compiled as built-ins into a direct call. On subsequent patches we expose the classifiers and actions functions and wire up the wrapper into tc. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: fix race condition in qdisc_graft()Eric Dumazet2022-10-191-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had one syzbot report [1] in syzbot queue for a while. I was waiting for more occurrences and/or a repro but Dmitry Vyukov spotted the issue right away. <quoting Dmitry> qdisc_graft() drops reference to qdisc in notify_and_destroy while it's still assigned to dev->qdisc </quoting> Indeed, RCU rules are clear when replacing a data structure. The visible pointer (dev->qdisc in this case) must be updated to the new object _before_ RCU grace period is started (qdisc_put(old) in this case). [1] BUG: KASAN: use-after-free in __tcf_qdisc_find.part.0+0xa3a/0xac0 net/sched/cls_api.c:1066 Read of size 4 at addr ffff88802065e038 by task syz-executor.4/21027 CPU: 0 PID: 21027 Comm: syz-executor.4 Not tainted 6.0.0-rc3-syzkaller-00363-g7726d4c3e60b #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:317 [inline] print_report.cold+0x2ba/0x719 mm/kasan/report.c:433 kasan_report+0xb1/0x1e0 mm/kasan/report.c:495 __tcf_qdisc_find.part.0+0xa3a/0xac0 net/sched/cls_api.c:1066 __tcf_qdisc_find net/sched/cls_api.c:1051 [inline] tc_new_tfilter+0x34f/0x2200 net/sched/cls_api.c:2018 rtnetlink_rcv_msg+0x955/0xca0 net/core/rtnetlink.c:6081 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2501 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 ____sys_sendmsg+0x6eb/0x810 net/socket.c:2482 ___sys_sendmsg+0x110/0x1b0 net/socket.c:2536 __sys_sendmsg+0xf3/0x1c0 net/socket.c:2565 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f5efaa89279 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f5efbc31168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f5efab9bf80 RCX: 00007f5efaa89279 RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000005 RBP: 00007f5efaae32e9 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5efb0cfb1f R14: 00007f5efbc31300 R15: 0000000000022000 </TASK> Allocated by task 21027: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:45 [inline] set_alloc_info mm/kasan/common.c:437 [inline] ____kasan_kmalloc mm/kasan/common.c:516 [inline] ____kasan_kmalloc mm/kasan/common.c:475 [inline] __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:525 kmalloc_node include/linux/slab.h:623 [inline] kzalloc_node include/linux/slab.h:744 [inline] qdisc_alloc+0xb0/0xc50 net/sched/sch_generic.c:938 qdisc_create_dflt+0x71/0x4a0 net/sched/sch_generic.c:997 attach_one_default_qdisc net/sched/sch_generic.c:1152 [inline] netdev_for_each_tx_queue include/linux/netdevice.h:2437 [inline] attach_default_qdiscs net/sched/sch_generic.c:1170 [inline] dev_activate+0x760/0xcd0 net/sched/sch_generic.c:1229 __dev_open+0x393/0x4d0 net/core/dev.c:1441 __dev_change_flags+0x583/0x750 net/core/dev.c:8556 rtnl_configure_link+0xee/0x240 net/core/rtnetlink.c:3189 rtnl_newlink_create net/core/rtnetlink.c:3371 [inline] __rtnl_newlink+0x10b8/0x17e0 net/core/rtnetlink.c:3580 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3593 rtnetlink_rcv_msg+0x43a/0xca0 net/core/rtnetlink.c:6090 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2501 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 ____sys_sendmsg+0x6eb/0x810 net/socket.c:2482 ___sys_sendmsg+0x110/0x1b0 net/socket.c:2536 __sys_sendmsg+0xf3/0x1c0 net/socket.c:2565 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 21020: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track+0x21/0x30 mm/kasan/common.c:45 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 ____kasan_slab_free mm/kasan/common.c:367 [inline] ____kasan_slab_free+0x166/0x1c0 mm/kasan/common.c:329 kasan_slab_free include/linux/kasan.h:200 [inline] slab_free_hook mm/slub.c:1754 [inline] slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1780 slab_free mm/slub.c:3534 [inline] kfree+0xe2/0x580 mm/slub.c:4562 rcu_do_batch kernel/rcu/tree.c:2245 [inline] rcu_core+0x7b5/0x1890 kernel/rcu/tree.c:2505 __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571 Last potentially related work creation: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 __kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348 call_rcu+0x99/0x790 kernel/rcu/tree.c:2793 qdisc_put+0xcd/0xe0 net/sched/sch_generic.c:1083 notify_and_destroy net/sched/sch_api.c:1012 [inline] qdisc_graft+0xeb1/0x1270 net/sched/sch_api.c:1084 tc_modify_qdisc+0xbb7/0x1a00 net/sched/sch_api.c:1671 rtnetlink_rcv_msg+0x43a/0xca0 net/core/rtnetlink.c:6090 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2501 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 ____sys_sendmsg+0x6eb/0x810 net/socket.c:2482 ___sys_sendmsg+0x110/0x1b0 net/socket.c:2536 __sys_sendmsg+0xf3/0x1c0 net/socket.c:2565 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Second to last potentially related work creation: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 __kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348 kvfree_call_rcu+0x74/0x940 kernel/rcu/tree.c:3322 neigh_destroy+0x431/0x630 net/core/neighbour.c:912 neigh_release include/net/neighbour.h:454 [inline] neigh_cleanup_and_release+0x1f8/0x330 net/core/neighbour.c:103 neigh_del net/core/neighbour.c:225 [inline] neigh_remove_one+0x37d/0x460 net/core/neighbour.c:246 neigh_forced_gc net/core/neighbour.c:276 [inline] neigh_alloc net/core/neighbour.c:447 [inline] ___neigh_create+0x18b5/0x29a0 net/core/neighbour.c:642 ip6_finish_output2+0xfb8/0x1520 net/ipv6/ip6_output.c:125 __ip6_finish_output net/ipv6/ip6_output.c:195 [inline] ip6_finish_output+0x690/0x1160 net/ipv6/ip6_output.c:206 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x1ed/0x540 net/ipv6/ip6_output.c:227 dst_output include/net/dst.h:451 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] mld_sendpack+0xa09/0xe70 net/ipv6/mcast.c:1820 mld_send_cr net/ipv6/mcast.c:2121 [inline] mld_ifc_work+0x71c/0xdc0 net/ipv6/mcast.c:2653 process_one_work+0x991/0x1610 kernel/workqueue.c:2289 worker_thread+0x665/0x1080 kernel/workqueue.c:2436 kthread+0x2e4/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 The buggy address belongs to the object at ffff88802065e000 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 56 bytes inside of 1024-byte region [ffff88802065e000, ffff88802065e400) The buggy address belongs to the physical page: page:ffffea0000819600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x20658 head:ffffea0000819600 order:3 compound_mapcount:0 compound_pincount:0 flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000010200 0000000000000000 dead000000000001 ffff888011841dc0 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3523, tgid 3523 (sshd), ts 41495190986, free_ts 41417713212 prep_new_page mm/page_alloc.c:2532 [inline] get_page_from_freelist+0x109b/0x2ce0 mm/page_alloc.c:4283 __alloc_pages+0x1c7/0x510 mm/page_alloc.c:5515 alloc_pages+0x1a6/0x270 mm/mempolicy.c:2270 alloc_slab_page mm/slub.c:1824 [inline] allocate_slab+0x27e/0x3d0 mm/slub.c:1969 new_slab mm/slub.c:2029 [inline] ___slab_alloc+0x7f1/0xe10 mm/slub.c:3031 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3118 slab_alloc_node mm/slub.c:3209 [inline] __kmalloc_node_track_caller+0x2f2/0x380 mm/slub.c:4955 kmalloc_reserve net/core/skbuff.c:358 [inline] __alloc_skb+0xd9/0x2f0 net/core/skbuff.c:430 alloc_skb_fclone include/linux/skbuff.h:1307 [inline] tcp_stream_alloc_skb+0x38/0x580 net/ipv4/tcp.c:861 tcp_sendmsg_locked+0xc36/0x2f80 net/ipv4/tcp.c:1325 tcp_sendmsg+0x2b/0x40 net/ipv4/tcp.c:1483 inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 sock_write_iter+0x291/0x3d0 net/socket.c:1108 call_write_iter include/linux/fs.h:2187 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x9e9/0xdd0 fs/read_write.c:578 ksys_write+0x1e8/0x250 fs/read_write.c:631 page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1449 [inline] free_pcp_prepare+0x5e4/0xd20 mm/page_alloc.c:1499 free_unref_page_prepare mm/page_alloc.c:3380 [inline] free_unref_page+0x19/0x4d0 mm/page_alloc.c:3476 __unfreeze_partials+0x17c/0x1a0 mm/slub.c:2548 qlink_free mm/kasan/quarantine.c:168 [inline] qlist_free_all+0x6a/0x170 mm/kasan/quarantine.c:187 kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:294 __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:447 kasan_slab_alloc include/linux/kasan.h:224 [inline] slab_post_alloc_hook mm/slab.h:727 [inline] slab_alloc_node mm/slub.c:3243 [inline] slab_alloc mm/slub.c:3251 [inline] __kmem_cache_alloc_lru mm/slub.c:3258 [inline] kmem_cache_alloc+0x267/0x3b0 mm/slub.c:3268 kmem_cache_zalloc include/linux/slab.h:723 [inline] alloc_buffer_head+0x20/0x140 fs/buffer.c:2974 alloc_page_buffers+0x280/0x790 fs/buffer.c:829 create_empty_buffers+0x2c/0xee0 fs/buffer.c:1558 ext4_block_write_begin+0x1004/0x1530 fs/ext4/inode.c:1074 ext4_da_write_begin+0x422/0xae0 fs/ext4/inode.c:2996 generic_perform_write+0x246/0x560 mm/filemap.c:3738 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:270 ext4_file_write_iter+0x44a/0x1660 fs/ext4/file.c:679 call_write_iter include/linux/fs.h:2187 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x9e9/0xdd0 fs/read_write.c:578 Fixes: af356afa010f ("net_sched: reintroduce dev->qdisc for use by sch_api") Reported-by: syzbot <syzkaller@googlegroups.com> Diagnosed-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221018203258.2793282-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: sched: ensure n arg not empty before call bind_classZhengchao Shao2022-10-021-1/+1
| | | | | | | | All bind_class callbacks are directly returned when n arg is empty. Therefore, bind_class is invoked only when n arg is not empty. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: query offload capabilities through ndo_setup_tc()Vladimir Oltean2022-09-291-0/+17
| | | | | | | | | | | | | | | | | | | | | | When adding optional new features to Qdisc offloads, existing drivers must reject the new configuration until they are coded up to act on it. Since modifying all drivers in lockstep with the changes in the Qdisc can create problems of its own, it would be nice if there existed an automatic opt-in mechanism for offloading optional features. Jakub proposes that we multiplex one more kind of call through ndo_setup_tc(): one where the driver populates a Qdisc-specific capability structure. First user will be taprio in further changes. Here we are introducing the definitions for the base functionality. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220923163310.3192733-3-vladimir.oltean@nxp.com/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: sched: remove duplicate check of user rights in qdiscZhengchao Shao2022-08-231-11/+0
| | | | | | | | | | In rtnetlink_rcv_msg function, the permission for all user operations is checked except the GET operation, which is the same as the checking in qdisc. Therefore, remove the user rights check in qdisc. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Message-Id: <20220819041854.83372-1-shaozhengchao@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* net_sched: move from strlcpy with unused retval to strscpyWolfram Sang2022-08-221-1/+1
| | | | | | | | | | | Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/ Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://lore.kernel.org/r/20220818210228.8635-1-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: sched: delete unused input parameter in qdisc_createZhengchao Shao2022-08-161-3/+3
| | | | | | | | The input parameter p is unused in qdisc_create. Delete it. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20220815061023.51318-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: sched: remove the unused return value of unregister_qdiscZhengchao Shao2022-08-161-2/+3
| | | | | | | | Return value of unregister_qdisc is unused, remove it. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20220815030417.271894-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: rename reference+tracking helpersJakub Kicinski2022-06-091-1/+1
| | | | | | | | | | | | | | | Netdev reference helpers have a dev_ prefix for historic reasons. Renaming the old helpers would be too much churn but we can rename the tracking ones which are relatively recent and should be the default for new code. Rename: dev_hold_track() -> netdev_hold() dev_put_track() -> netdev_put() dev_replace_track() -> netdev_ref_replace() Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net_sched: add __rcu annotation to netdev->qdiscEric Dumazet2022-02-141-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot found a data-race [1] which lead me to add __rcu annotations to netdev->qdisc, and proper accessors to get LOCKDEP support. [1] BUG: KCSAN: data-race in dev_activate / qdisc_lookup_rcu write to 0xffff888168ad6410 of 8 bytes by task 13559 on cpu 1: attach_default_qdiscs net/sched/sch_generic.c:1167 [inline] dev_activate+0x2ed/0x8f0 net/sched/sch_generic.c:1221 __dev_open+0x2e9/0x3a0 net/core/dev.c:1416 __dev_change_flags+0x167/0x3f0 net/core/dev.c:8139 rtnl_configure_link+0xc2/0x150 net/core/rtnetlink.c:3150 __rtnl_newlink net/core/rtnetlink.c:3489 [inline] rtnl_newlink+0xf4d/0x13e0 net/core/rtnetlink.c:3529 rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594 netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmsg+0x195/0x230 net/socket.c:2496 __do_sys_sendmsg net/socket.c:2505 [inline] __se_sys_sendmsg net/socket.c:2503 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888168ad6410 of 8 bytes by task 13560 on cpu 0: qdisc_lookup_rcu+0x30/0x2e0 net/sched/sch_api.c:323 __tcf_qdisc_find+0x74/0x3a0 net/sched/cls_api.c:1050 tc_del_tfilter+0x1c7/0x1350 net/sched/cls_api.c:2211 rtnetlink_rcv_msg+0x5ba/0x7e0 net/core/rtnetlink.c:5585 netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmsg+0x195/0x230 net/socket.c:2496 __do_sys_sendmsg net/socket.c:2505 [inline] __se_sys_sendmsg net/socket.c:2503 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0xffffffff85dee080 -> 0xffff88815d96ec00 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 13560 Comm: syz-executor.2 Not tainted 5.17.0-rc3-syzkaller-00116-gf1baf68e1383-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 470502de5bdb ("net: sched: unlock rules update API") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@mellanox.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: Clarify error message when qdisc kind is unknownVictor Nogueira2022-01-201-1/+1
| | | | | | | | | | | | | | When adding a tc rule with a qdisc kind that is not supported or not compiled into the kernel, the kernel emits the following error: "Error: Specified qdisc not found.". Found via tdc testing when ETS qdisc was not compiled in and it was not obvious right away what the message meant without looking at the kernel code. Change the error message to be more explicit and say the qdisc kind is unknown. Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sch_api: Don't skip qdisc attach on ingressMaxim Mikityanskiy2022-01-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The attach callback of struct Qdisc_ops is used by only a few qdiscs: mq, mqprio and htb. qdisc_graft() contains the following logic (pseudocode): if (!qdisc->ops->attach) { if (ingress) do ingress stuff; else do egress stuff; } if (!ingress) { ... if (qdisc->ops->attach) qdisc->ops->attach(qdisc); } else { ... } As we see, the attach callback is not called if the qdisc is being attached to ingress (TC_H_INGRESS). That wasn't a problem for mq and mqprio, since they contain a check that they are attached to TC_H_ROOT, and they can't be attached to TC_H_INGRESS anyway. However, the commit cited below added the attach callback to htb. It is needed for the hardware offload, but in the non-offload mode it simulates the "do egress stuff" part of the pseudocode above. The problem is that when htb is attached to ingress, neither "do ingress stuff" nor attach() is called. It results in an inconsistency, and the following message is printed to dmesg: unregister_netdevice: waiting for lo to become free. Usage count = 2 This commit addresses the issue by running "do ingress stuff" in the ingress flow even in the attach callback is present, which is fine, because attach isn't going to be called afterwards. The bug was found by syzbot and reported by Eric. Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reported-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: add missing tracker information in qdisc_create()Eric Dumazet2022-01-051-1/+1
| | | | | | | | | | qdisc_create() error path needs to use dev_put_track() because qdisc_alloc() allocated the tracker. Fixes: 606509f27f67 ("net/sched: add net device refcount tracker to struct Qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220104170439.3790052-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: sched: Remove Qdisc::running sequence counterAhmed S. Darwish2021-10-181-13/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Qdisc::running sequence counter has two uses: 1. Reliably reading qdisc's tc statistics while the qdisc is running (a seqcount read/retry loop at gnet_stats_add_basic()). 2. As a flag, indicating whether the qdisc in question is running (without any retry loops). For the first usage, the Qdisc::running sequence counter write section, qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what is actually needed: the raw qdisc's bstats update. A u64_stats sync point was thus introduced (in previous commits) inside the bstats structure itself. A local u64_stats write section is then started and stopped for the bstats updates. Use that u64_stats sync point mechanism for the bstats read/retry loop at gnet_stats_add_basic(). For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag, accessed with atomic bitops, is sufficient. Using a bit flag instead of a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads to the SMP barriers implicitly added through raw_read_seqcount() and write_seqcount_begin/end() getting removed. All call sites have been surveyed though, and no required ordering was identified. Now that the qdisc->running sequence counter is no longer used, remove it. Note, using u64_stats implies no sequence counter protection for 64-bit architectures. This can lead to the qdisc tc statistics "packets" vs. "bytes" values getting out of sync on rare occasions. The individual values will still be valid. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data typesAhmed S. Darwish2021-10-181-1/+1
| | | | | | | | | | | | | | | | | | | The only factor differentiating per-CPU bstats data type (struct gnet_stats_basic_cpu) from the packed non-per-CPU one (struct gnet_stats_basic_packed) was a u64_stats sync point inside the former. The two data types are now equivalent: earlier commits added a u64_stats sync point to the latter. Combine both data types into "struct gnet_stats_basic_sync". This eliminates redundancy and simplifies the bstats read/write APIs. Use u64_stats_t for bstats "packets" and "bytes" data types. On 64-bit architectures, u64_stats sync points do not use sequence counter protection. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-09-301-0/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | drivers/net/phy/bcm7xxx.c d88fd1b546ff ("net: phy: bcm7xxx: Fixed indirect MMD operations") f68d08c437f9 ("net: phy: bcm7xxx: Add EPHY entry for 72165") net/sched/sch_api.c b193e15ac69d ("net: prevent user from passing illegal stab size") 69508d43334e ("net_sched: Use struct_size() and flex_array_size() helpers") Both cases trivial - adjacent code additions. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: prevent user from passing illegal stab size王贇2021-09-261-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We observed below report when playing with netlink sock: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:580:10 shift exponent 249 is too large for 32-bit type CPU: 0 PID: 685 Comm: a.out Not tainted Call Trace: dump_stack_lvl+0x8d/0xcf ubsan_epilogue+0xa/0x4e __ubsan_handle_shift_out_of_bounds+0x161/0x182 __qdisc_calculate_pkt_len+0xf0/0x190 __dev_queue_xmit+0x2ed/0x15b0 it seems like kernel won't check the stab log value passing from user, and will use the insane value later to calculate pkt_len. This patch just add a check on the size/cell_log to avoid insane calculation. Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: Use struct_size() and flex_array_size() helpersGustavo A. R. Silva2021-09-291-3/+4
|/ | | | | | | | | | | | Make use of the struct_size() and flex_array_size() helpers instead of an open-coded version, in order to avoid any potential type mistakes or integer overflows that, in the worse scenario, could lead to heap overflows. Link: https://github.com/KSPP/linux/issues/160 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20210928193107.GA262595@embeddedor Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/sched: Remove unnecessary if statementYajun Deng2021-07-161-8/+2
| | | | | | | | | | It has been deal with the 'if (err' statement in rtnetlink_send() and rtnl_unicast(). so remove unnecessary if statement. v2: use the raw name rtnetlink_send(). Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: avoid duplicates in classes dumpMaximilian Heyne2021-03-041-4/+4
| | | | | | | | | | | | | This is a follow up of commit ea3274695353 ("net: sched: avoid duplicates in qdisc dump") which has fixed the issue only for the qdisc dump. The duplicate printing also occurs when dumping the classes via tc class show dev eth0 Fixes: 59cc1f61f09c ("net: sched: convert qdisc linked list to hashtable") Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: Add extack to Qdisc_class_ops.deleteMaxim Mikityanskiy2021-01-221-3/+4
| | | | | | | | | | | In a following commit, sch_htb will start using extack in the delete class operation to pass hardware errors in offload mode. This commit prepares for that by adding the extack parameter to this callback and converting usage of the existing qdiscs. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net_sched: reject silly cell_log in qdisc_get_rtab()Eric Dumazet2021-01-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iproute2 probably never goes beyond 8 for the cell exponent, but stick to the max shift exponent for signed 32bit. UBSAN reported: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22 shift exponent 130 is too large for 32-bit type 'int' CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x183/0x22e lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] ____sys_sendmsg+0x5a2/0x900 net/socket.c:2345 ___sys_sendmsg net/socket.c:2399 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: sched: remove redundant 'rtnl_held' argumentVlad Buslov2020-12-011-2/+2
| | | | | | | | | | | | | Functions tfilter_notify_chain() and tcf_get_next_proto() are always called with rtnl lock held in current implementation. Moreover, attempting to call them without rtnl lock would cause a warning down the call chain in function __tcf_get_next_proto() that requires the lock to be held by callers. Remove the 'rtnl_held' argument in order to simplify the code and make rtnl lock requirement explicit. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Link: https://lore.kernel.org/r/20201127151205.23492-1-vladbu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* treewide: rename nla_strlcpy to nla_strscpy.Francis Laniel2020-11-161-1/+1
| | | | | | | | | Calls to nla_strlcpy are now replaced by calls to nla_strscpy which is the new name of this function. Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Modify return value of nla_strlcpy to match that of strscpy.Francis Laniel2020-11-161-1/+1
| | | | | | | | | | | | | | | | nla_strlcpy now returns -E2BIG if src was truncated when written to dst. It also returns this error value if dstsize is 0 or higher than INT_MAX. For example, if src is "foo\0" and dst is 3 bytes long, the result will be: 1. "foG" after memcpy (G means garbage). 2. "fo\0" after memset. 3. -E2BIG is returned because src was not completely written into dst. The callers of nla_strlcpy were modified to take into account this modification. Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* sched: sch_api: add missing rcu read lock to silence the warningJiri Pirko2020-07-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case the qdisc_match_from_root function() is called from non-rcu path with rtnl mutex held, a suspiciout rcu usage warning appears: [ 241.504354] ============================= [ 241.504358] WARNING: suspicious RCU usage [ 241.504366] 5.8.0-rc4-custom-01521-g72a7c7d549c3 #32 Not tainted [ 241.504370] ----------------------------- [ 241.504378] net/sched/sch_api.c:270 RCU-list traversed in non-reader section!! [ 241.504382] other info that might help us debug this: [ 241.504388] rcu_scheduler_active = 2, debug_locks = 1 [ 241.504394] 1 lock held by tc/1391: [ 241.504398] #0: ffffffff85a27850 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0 [ 241.504431] stack backtrace: [ 241.504440] CPU: 0 PID: 1391 Comm: tc Not tainted 5.8.0-rc4-custom-01521-g72a7c7d549c3 #32 [ 241.504446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 [ 241.504453] Call Trace: [ 241.504465] dump_stack+0x100/0x184 [ 241.504482] lockdep_rcu_suspicious+0x153/0x15d [ 241.504499] qdisc_match_from_root+0x293/0x350 Fix this by passing the rtnl held lockdep condition down to hlist_for_each_entry_rcu() Reported-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Remove redundant condition in qdisc_graftGaurav Singh2020-06-201-2/+1
| | | | | | | | | parent cannot be NULL here since its in the else part of the if (parent == NULL) condition. Remove the extra check on parent pointer. Signed-off-by: Gaurav Singh <gaurav1086@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: add a tracepoint for qdisc creationCong Wang2020-05-271-0/+3
| | | | | | | | | | | | | | | | | | | | With this tracepoint, we could know when qdisc's are created, especially those default qdisc's. Sample output: tc-736 [001] ...1 56.230107: qdisc_create: dev=ens3 kind=pfifo parent=1:0 tc-736 [001] ...1 56.230113: qdisc_create: dev=ens3 kind=hfsc parent=ffff:ffff tc-738 [001] ...1 56.256816: qdisc_create: dev=ens3 kind=pfifo parent=1:100 tc-739 [001] ...1 56.267584: qdisc_create: dev=ens3 kind=pfifo parent=1:200 tc-740 [001] ...1 56.279649: qdisc_create: dev=ens3 kind=fq_codel parent=1:100 tc-741 [001] ...1 56.289996: qdisc_create: dev=ens3 kind=pfifo_fast parent=1:200 tc-745 [000] .N.1 111.687483: qdisc_create: dev=ens3 kind=ingress parent=ffff:fff1 Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: do not reprogram a timer about to expireEric Dumazet2020-03-171-2/+7
| | | | | | | | | | | | qdisc_watchdog_schedule_range_ns() can use the newly added slack and avoid rearming the hrtimer a bit earlier than the current value. This patch has no effect if delta_ns parameter is zero. Note that this means the max slack is potentially doubled. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: add qdisc_watchdog_schedule_range_ns()Eric Dumazet2020-03-171-5/+7
| | | | | | | | | | | | | | | | Some packet schedulers might want to add a slack when programming hrtimers. This can reduce number of interrupts and increase batch sizes and thus give good xmit_more savings. This commit adds qdisc_watchdog_schedule_range_ns() helper, with an extra delta_ns parameter. Legacy qdisc_watchdog_schedule_n() becomes an inline passing a zero slack. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: walk through all child classes in tc_bind_tclass()Cong Wang2020-01-271-11/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a complex TC class hierarchy like this: tc qdisc add dev eth0 root handle 1:0 cbq bandwidth 100Mbit \ avpkt 1000 cell 8 tc class add dev eth0 parent 1:0 classid 1:1 cbq bandwidth 100Mbit \ rate 6Mbit weight 0.6Mbit prio 8 allot 1514 cell 8 maxburst 20 \ avpkt 1000 bounded tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \ sport 80 0xffff flowid 1:3 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \ sport 25 0xffff flowid 1:4 tc class add dev eth0 parent 1:1 classid 1:3 cbq bandwidth 100Mbit \ rate 5Mbit weight 0.5Mbit prio 5 allot 1514 cell 8 maxburst 20 \ avpkt 1000 tc class add dev eth0 parent 1:1 classid 1:4 cbq bandwidth 100Mbit \ rate 3Mbit weight 0.3Mbit prio 5 allot 1514 cell 8 maxburst 20 \ avpkt 1000 where filters are installed on qdisc 1:0, so we can't merely search from class 1:1 when creating class 1:3 and class 1:4. We have to walk through all the child classes of the direct parent qdisc. Otherwise we would miss filters those need reverse binding. Fixes: 07d79fc7d94e ("net_sched: add reverse binding for tc class") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: fix ops->bind_class() implementationsCong Wang2020-01-271-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementations of ops->bind_class() are merely searching for classid and updating class in the struct tcf_result, without invoking either of cl_ops->bind_tcf() or cl_ops->unbind_tcf(). This breaks the design of them as qdisc's like cbq use them to count filters too. This is why syzbot triggered the warning in cbq_destroy_class(). In order to fix this, we have to call cl_ops->bind_tcf() and cl_ops->unbind_tcf() like the filter binding path. This patch does so by refactoring out two helper functions __tcf_bind_filter() and __tcf_unbind_filter(), which are lockless and accept a Qdisc pointer, then teaching each implementation to call them correctly. Note, we merely pass the Qdisc pointer as an opaque pointer to each filter, they only need to pass it down to the helper functions without understanding it at all. Fixes: 07d79fc7d94e ("net_sched: add reverse binding for tc class") Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: fix backward compatibility for TCA_KINDCong Wang2019-10-081-2/+1
| | | | | | | | | | | | | | | | | Marcelo noticed a backward compatibility issue of TCA_KIND after we move from NLA_STRING to NLA_NUL_STRING, so it is probably too late to change it. Instead, to make everyone happy, we can just insert a NUL to terminate the string with nla_strlcpy() like we do for TC actions. Fixes: 62794fc4fbf5 ("net_sched: add max len check for TCA_KIND") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
* net_sched: add max len check for TCA_KINDCong Wang2019-09-211-1/+2
| | | | | | | | | | | | | | | | The TCA_KIND attribute is of NLA_STRING which does not check the NUL char. KMSAN reported an uninit-value of TCA_KIND which is likely caused by the lack of NUL. Change it to NLA_NUL_STRING and add a max len too. Fixes: 8b4c3cdd9dd8 ("net: sched: Add policy validation for tc attributes") Reported-and-tested-by: syzbot+618aacd49e8c8b8486bd@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
* net_sched: check cops->tcf_block in tc_bind_tclass()Cong Wang2019-09-101-0/+2
| | | | | | | | | | | | | | | At least sch_red and sch_tbf don't implement ->tcf_block() while still have a non-zero tc "class". Instead of adding nop implementations to each of such qdisc's, we can just relax the check of cops->tcf_block() in tc_bind_tclass(). They don't support TC filter anyway. Reported-by: syzbot+21b29db13c065852f64b@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner2019-05-301-5/+1
| | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* netlink: make validation more configurable for future strictnessJohannes Berg2019-04-271-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netlink: make nla_nest_start() add NLA_F_NESTED flagMichal Kubecek2019-04-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most netlink based interfaces (including recently added ones) are still not setting it in kernel generated messages. Without the flag, message parsers not aware of attribute semantics (e.g. wireshark dissector or libmnl's mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display the structure of their contents. Unfortunately we cannot just add the flag everywhere as there may be userspace applications which check nlattr::nla_type directly rather than through a helper masking out the flags. Therefore the patch renames nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start() as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually are rewritten to use nla_nest_start(). Except for changes in include/net/netlink.h, the patch was generated using this semantic patch: @@ expression E1, E2; @@ -nla_nest_start(E1, E2) +nla_nest_start_noflag(E1, E2) @@ expression E1, E2; @@ -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED) +nla_nest_start(E1, E2) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, tooPaolo Abeni2019-04-101-1/+14
| | | | | | | | | | | | | | | | | | | | | Since stats updating is always consistent with TCQ_F_CPUSTATS flag, we can disable it at qdisc creation time flipping such bit. In my experiments, if the NOLOCK flag is cleared, per CPU stats accounting does not give any measurable performance gain, but it waste some memory. Let's clear TCQ_F_CPUSTATS together with NOLOCK, when enslaving a NOLOCK qdisc to 'lock' one. Use stats update helper inside pfifo_fast, to cope correctly with TCQ_F_CPUSTATS flag change. As a side effect, q.qlen value for any child qdiscs is always consistent for all lock classfull qdiscs. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: return correct value for *notify* functionsZhike Wang2019-03-131-4/+11
| | | | | | | | | | | | | | | | | | | | It is confusing to directly use return value of netlink_send()/ netlink_unicast() as the return value of *notify*, as it may be not error at all. Example: in tc_del_tfilter(), after calling tfilter_del_notify(), it will goto errout if (err). However, the netlink_send()/netlink_unicast() will return positive value even for successful case. So it may not call tcf_chain_tp_remove() and so on to clean up the resource, as a result, resource is leaked. It may be easier to only check the return value of tfilter_del_nofiy(), but it is more clean to correct all related functions. Co-developed-by: Zengmo Gao <gaozengmo@jd.com> Signed-off-by: Zhike Wang <wangzhike@jd.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: using kfree_rcu() to simplify the codeWei Yongjun2019-02-181-6/+1
| | | | | | | | The callback function of call_rcu() just calls a kfree(), so we can use kfree_rcu() instead of call_rcu() + callback function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: sch_api: set an error msg when qdisc_alloc_handle() failsIvan Vecera2019-02-171-2/+4
| | | | | | | | | | | This patch sets an error message in extack when the number of qdisc handles exceeds the maximum. Also the error-code ENOSPC is more appropriate than ENOMEM in this situation. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reported-by: Li Shuang <shuali@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: extend proto ops to support unlocked classifiersVlad Buslov2019-02-121-3/+3
| | | | | | | | | | | | | | | Add 'rtnl_held' flag to tcf proto change, delete, destroy, dump, walk functions to track rtnl lock status. Extend users of these function in cls API to propagate rtnl lock status to them. This allows classifiers to obtain rtnl lock when necessary and to pass rtnl lock status to extensions and driver offload callbacks. Add flags field to tcf proto ops. Add flag value to indicate that classifier doesn't require rtnl lock. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: traverse classifiers in chain with tcf_get_next_proto()Vlad Buslov2019-02-121-2/+2
| | | | | | | | | | | | | All users of chain->filters_chain rely on rtnl lock and assume that no new classifier instances are added when traversing the list. Use tcf_get_next_proto() to traverse filters list without relying on rtnl mutex. This function iterates over classifiers by taking reference to current iterator classifier only and doesn't assume external synchronization of filters list. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: traverse chains in block with tcf_get_next_chain()Vlad Buslov2019-02-121-1/+3
| | | | | | | | | | | | | | | | All users of block->chain_list rely on rtnl lock and assume that no new chains are added when traversing the list. Use tcf_get_next_chain() to traverse chain list without relying on rtnl mutex. This function iterates over chains by taking reference to current iterator chain only and doesn't assume external synchronization of chain list. Don't take reference to all chains in block when flushing and use tcf_get_next_chain() to safely iterate over chain list instead. Remove tcf_block_put_all_chains() that is no longer used. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>