summaryrefslogtreecommitdiff
path: root/drivers/accel/habanalabs
Commit message (Collapse)AuthorAgeFilesLines
* accel/habanalabs: mask part of hmmu page fault captured addressDani Liberman2023-05-161-3/+11
| | | | | | | | | | | | When receiving page fault from hmmu, the captured address is scrambled both by HW and by driver. The driver part is unscrambled but the HW part isn't getting unscrambled. To avoid declaring wrong address, the HW scrambled part will be masked. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: update state when loading boot fitKoby Elbaz2023-05-161-15/+10
| | | | | | | | | | Any FW component we load must be followed by a corresponding state update. However, it seems that so far we skipped doing so for the bootfit case, so fix that. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: print qman data on error only for lower qmanTomer Tayar2023-05-163-128/+31
| | | | | | | | | | | | | | By default, the upper QMANs are not used, and instead engines ARCs access the lower QMANs directly. Errors for upper QMANs are therefore not expected, and the debug print of the PQ entries is not needed. Modify the QMAN debug data print on errors to include only information for the lower QMAN. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: use lower QM in QM errors handlingTomer Tayar2023-05-161-5/+5
| | | | | | | | | | | The QMAN GLBL_ERR_STS_4 register has indications for errors also in the lower CQ and the ARC CQ, and not just for errors in the lower CP. Modify the relevant define/struct and the related print to use "lower QM" instead of "lower CP". Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: use binning info when handling razwiDani Liberman2023-05-161-3/+14
| | | | | | | | | | | When receiving sei interrupt from tpc or decoder, we need to check the binning mask because if the engine is binned, the razwi info won't be in the router of the binned engine, instead will be in the router of the substitute engine. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: remove support for mmu disableOfir Bitton2023-05-1611-232/+26
| | | | | | | | | As mmu disable mode is only used for bring-up stages, let's remove this option and all code related to it. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: upon DMA errors, use FW-extracted error causeKoby Elbaz2023-05-161-29/+8
| | | | | | | | | | | Initially, the driver used to read the error cause data directly from the ASIC. However, the FW now clears it before the driver could read it. Therefore we should use the error cause data that is extracted by the FW. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: print max timeout value on CS stuckOded Gabbay2023-05-161-11/+15
| | | | | | | | | If a workload got stuck, we print an error to the kernel log about it. Add to that print the configured max timeout value, as that value is not fixed between ASICs and in addition it can be configured using a kernel module parameter. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: align to latest firmware specsOded Gabbay2023-05-162-43/+16
| | | | | | Update the firmware common interface files with the latest version. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: fix mem leak in capture user mappingsMoti Haimovski2023-05-161-0/+2
| | | | | | | | | | | This commit fixes a memory leak caused when clearing the user_mappings info when a new context is opened immediately after user_mapping is captured and a hard reset is performed. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: set unused bit as reservedOded Gabbay2023-05-161-1/+1
| | | | | | | Get latest f/w gaudi2 interface file which marks unused bist_need_iatu_config bit in cold_rst_data structure as reserved bit. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: rename security functions related argumentsKoby Elbaz2023-05-161-28/+29
| | | | | | | | | Make the argument names specify the registers array represent registers that should be unsecured so the user can access them. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: fix gaudi2_get_tpc_idle_status() returnDan Carpenter2023-05-161-1/+1
| | | | | | | | | | The gaudi2_get_tpc_idle_status() function returned the incorrect variable so it always returned true. Fixes: d85f0531b928 ("accel/habanalabs: break is_idle function into per-engine sub-routines") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: Fix some kernel-doc commentsYang Li2023-05-161-2/+2
| | | | | | | | | | | | | | | | | Make the description of @regs_range_array and @regs_range_array_size to @user_regs_range_array and @user_regs_range_array_size to silence the warnings: drivers/accel/habanalabs/common/security.c:506: warning: Function parameter or member 'user_regs_range_array' not described in 'hl_init_pb_ranges_single_dcore' drivers/accel/habanalabs/common/security.c:506: warning: Function parameter or member 'user_regs_range_array_size' not described in 'hl_init_pb_ranges_single_dcore' drivers/accel/habanalabs/common/security.c:506: warning: Excess function parameter 'regs_range_array' description in 'hl_init_pb_ranges_single_dcore' drivers/accel/habanalabs/common/security.c:506: warning: Excess function parameter 'regs_range_array_size' description in 'hl_init_pb_ranges_single_dcore' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=4940 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: always fetch pci addr_dec error infoOfir Bitton2023-05-151-8/+6
| | | | | | | | | Due to missing indication of address decode source (LBW/HBW bus), we should always try and fetch extended information. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: fix a static warning - 'dubious: x & !y'Koby Elbaz2023-05-151-3/+3
| | | | | | | | Use a straight forward approach to get a conditional result. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: poll for device status update following WFE cmdKoby Elbaz2023-05-151-5/+23
| | | | | | | | | | | | | | Currently, we rely on COMMS protocol's ack to verify that WFE command has been acknowledged by the FW. However, this does not guarantee that the device status has been updated. Although unlikely, this could trigger a race since the driver expects the device to be halted at that stage, but it might not be. Therefore, we increase WFE's robustness by polling on the status register that will be updated once the device is actually halted. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: expose debugfs files laterTomer Tayar2023-05-153-44/+61
| | | | | | | | | | | | | | | Currently the debugfs root folder and files for a device are created at an early step, before the device initialization and before the char device and sysfs files are exposed to user. As there is no real reason not to do it together with the device creation, postpone it to be done right afterwards. The initialization of the debugfs entry structure is left in its current position because it is used before creating the files. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: add pci health check during heartbeatOfir Bitton2023-05-153-3/+16
| | | | | | | | | | | | Currently upon a heartbeat failure, we don't know if the failure is due to firmware hang or due to a bad PCI link. Hence, we are reading a PCI config space register with a known value (vendor ID) so we will know which of the two possibilities caused the heartbeat failure. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: add missing tpc interrupt infoDafna Hirschfeld2023-05-151-1/+2
| | | | | | | | | For some reason the last possible tpc interrupt cause in gaudi2_tpc_interrupts_cause is missing from the code. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: refactor abort of completions and waitsKoby Elbaz2023-05-153-8/+14
| | | | | | | | | | | Aborting CS completions should be in command_submission.c but aborting waiting for user interrupts should be in device.c. This separation is also for adding more abort operations in the future. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: minimize encapsulation signal mutex lock timeKoby Elbaz2023-05-151-2/+8
| | | | | | | | | | | | | Sync Stream Encapsulated Signal Handlers can be managed from different contexts, and as such they are protected via a spin_lock. However, spin_lock was unnecessarily protecting a larger code section than really needed, covering a sleepable code section as well. Since spin_lock disables preemption, it could lead to sleeping in atomic context. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: add unregister timestamp uapifarah kassabri2023-05-151-23/+100
| | | | | | | | | | | Add uapi to allow user to unregister timestamp record. This is needed when the user wishes to re-use the same record with different interrupt id. For that, the user must first unregister it from the current interrupt id and then register it with the new id. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: call to HW/FW err returns 0 when no events existMoti Haimovski2023-05-151-10/+10
| | | | | | | | | | This commit modifies the call to retrieve HW or FW error events to return success when no events are pending, as done in the calls to other events. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: unsecure TPC bias registersOfir Bitton2023-05-151-0/+1
| | | | | | | | | User needs to be able to perform downcast / upcast of fp8_143 dtype. Hence bias register needs to be accessed by the user. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: do soft-reset using cpucp packetDafna Hirschfeld2023-05-154-10/+35
| | | | | | | | | This is done depending on the FW version. The cpucp method is preferable and saves scratchpads resource. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: check fw version using sw versionDafna Hirschfeld2023-05-152-3/+9
| | | | | | | | | The fw inner version is less trustable, instead use the fw general sw release version. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: extract and save the FW's SW major/minor/sub-minorDafna Hirschfeld2023-05-152-6/+78
| | | | | | | | | | | It is not always possible to know the FW's SW version from the inner FW version. Therefore we should extract the general SW version in addition to the FW version and use it in functions like 'hl_is_fw_ver_below_1_9' etc. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: rename fw_{major/minor}_version to fw_inner_{major/minor}_verDafna Hirschfeld2023-05-152-9/+9
| | | | | | | | | | We later want to add fields for Firmware SW version. The current extracted FW version is the inner FW versioning so the new name is better and also better differentiate from the FW's SW version. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: add helper to extract the FW major/minorDafna Hirschfeld2023-05-151-24/+45
| | | | | | | | | the helper is extract_u32_until_given_char and can later be used to also get the major/minor of the sw version. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: fix bug in free scratchpad memoryMoti Haimovski2023-05-151-2/+2
| | | | | | | | | This commit fixes a bug in Gaudi2 when freeing the scratchpad memory in case software init fails. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: remove commented code that won't be usedKoby Elbaz2023-05-151-9/+0
| | | | | | | | | Once it was decided that these security settings are to be done by FW rather than by the driver, there's no reason to keep them in the code. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: allow user to modify EDMA RL registerRakesh Ughreja2023-05-151-0/+1
| | | | | | | | | | | EDMA transpose workload requires to signal for every activation. User FW sends all the dummy signals to RD_LBW_RATE_LIM_CFG, to save lbw bandwidth. We need the user to be able to access that register to configure it. Signed-off-by: Rakesh Ughreja <rughreja@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: ignore false positive razwiTal Cohen2023-05-151-16/+27
| | | | | | | | | | | | | | In Gaudi2 asic, PSOC RAZWI may cause in HBW or LBW. The address that caused the error is read from HW register and printed by the Driver. There are cases where the Driver receives an indication on PSOC RAZWI error but the address value is zero. In that case, the indication is a false positive. The Driver should not "count" a PSOC RAZWI event error when the caused the address is zeroed. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* accel/habanalabs: remove variable gaudi_irq_nameTom Rix2023-05-151-7/+0
| | | | | | | | | | | | | | gcc with W=1 reports drivers/accel/habanalabs/gaudi/gaudi.c:117:19: error: ‘gaudi_irq_name’ defined but not used [-Werror=unused-const-variable=] 117 | static const char gaudi_irq_name[GAUDI_MSI_ENTRIES][GAUDI_MAX_STRING_LEN] = { | ^~~~~~~~~~~~~~ This variable is not used so remove it. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* Merge tag 'driver-core-6.4-rc1' of ↵Linus Torvalds2023-04-271-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.4-rc1. Once again, a busy development cycle, with lots of changes happening in the driver core in the quest to be able to move "struct bus" and "struct class" into read-only memory, a task now complete with these changes. This will make the future rust interactions with the driver core more "provably correct" as well as providing more obvious lifetime rules for all busses and classes in the kernel. The changes required for this did touch many individual classes and busses as many callbacks were changed to take const * parameters instead. All of these changes have been submitted to the various subsystem maintainers, giving them plenty of time to review, and most of them actually did so. Other than those changes, included in here are a small set of other things: - kobject logging improvements - cacheinfo improvements and updates - obligatory fw_devlink updates and fixes - documentation updates - device property cleanups and const * changes - firwmare loader dependency fixes. All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits) device property: make device_property functions take const device * driver core: update comments in device_rename() driver core: Don't require dynamic_debug for initcall_debug probe timing firmware_loader: rework crypto dependencies firmware_loader: Strip off \n from customized path zram: fix up permission for the hot_add sysfs file cacheinfo: Add use_arch[|_cache]_info field/function arch_topology: Remove early cacheinfo error message if -ENOENT cacheinfo: Check cache properties are present in DT cacheinfo: Check sib_leaf in cache_leaves_are_shared() cacheinfo: Allow early level detection when DT/ACPI info is missing/broken cacheinfo: Add arm64 early level initializer implementation cacheinfo: Add arch specific early level initializer tty: make tty_class a static const structure driver core: class: remove struct class_interface * from callbacks driver core: class: mark the struct class in struct class_interface constant driver core: class: make class_register() take a const * driver core: class: mark class_release() as taking a const * driver core: remove incorrect comment for device_create* MIPS: vpe-cmp: remove module owner pointer from struct class usage. ...
| * driver core: class: remove module * from class_create()Greg Kroah-Hartman2023-03-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The module pointer in class_create() never actually did anything, and it shouldn't have been requred to be set as a parameter even if it did something. So just remove it and fix up all callers of the function in the kernel tree at the same time. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'hwmon-for-v6.4' of ↵Linus Torvalds2023-04-251-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull hwmon updates from Guenter Roeck: "New drivers - Driver for Acbel FSB032 power supply - Driver for StarFive JH71x0 temperature sensor Added support to existing drivers: - aquacomputer_d5next: Support for Aquacomputer Aquastream XT - nct6775: Added various ASUS boards to list of boards supporting WMI - asus-ec-sensors: ROG STRIX Z390-F GAMING, ProArt B550-Creator, Notable improvements: - Regulator event and sysfs notification support for PMBus drivers Notable cleanup: - Constified pointers to hwmon_channel_info .. and various other minor bug fixes and improvements" * tag 'hwmon-for-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: (131 commits) hwmon: lochnagar: Remove the unneeded include <linux/i2c.h> hwmon: (pmbus/fsp-3y) Fix functionality bitmask in FSP-3Y YM-2151E hwmon: (adt7475) Use device_property APIs when configuring polarity hwmon: (aquacomputer_d5next) Add support for Aquacomputer Aquastream XT hwmon: (it87) Disable/enable SMBus access for IT8622E chipset hwmon: (it87) Add calls to smbus_enable/smbus_disable as required hwmon: (it87) Test for error in it87_update_device hwmon: (it87) Disable SMBus access for environmental controller registers. docs: hwmon: Add documentaion for acbel-fsg032 PSU hwmon: (pmbus/acbel-fsg032) Add Acbel power supply dt-bindings: trivial-devices: Add acbel,fsg032 dt-bindings: vendor-prefixes: Add prefix for acbel hwmon: (sfctemp) Simplify error message hwmon: (pmbus/ibm-cffps) Use default debugfs attributes and lock function hwmon: (pmbus/core) Add lock and unlock functions hwmon: (pmbus/core) Request threaded interrupt with IRQF_ONESHOT hwmon: (nct6775) update ASUS WMI monitoring list A620/B760/W790 hwmon: ina2xx: add optional regulator support dt-bindings: hwmon: ina2xx: add supply property dt-bindings: hwmon: pwm-fan: Convert to DT schema ...
| * | hwmon: constify pointers to hwmon_channel_infoKrzysztof Kozlowski2023-04-071-1/+1
| |/ | | | | | | | | | | | | | | | | HWmon core receives an array of pointers to hwmon_channel_info and it does not modify it, thus it can be array of const pointers for safety. This allows drivers to make them also const. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
* | accel/habanalabs: add missing error flow in hl_sysfs_init()Tomer Tayar2023-04-081-1/+5
| | | | | | | | | | | | | | | | | | | | hl_sysfs_fini() is called only if hl_sysfs_init() completes successfully. Therefore if hl_sysfs_init() fails, need to remove any sysfs group that was added until that point. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* | accel/habanalabs: speedup h/w queues test in Gaudi2Moti Haimovski2023-04-082-41/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | HW queues testing at driver load and after reset takes a substantial amount of time. This commit reduces the queues test time in Gaudi2 devices by running all the tests in parallel instead of one after the other. Time measurements on tests duration shows that the new method is almost x100 faster than the serial approach. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* | accel/habanalabs: fix handling of arc farm sei eventDani Liberman2023-04-082-11/+18
| | | | | | | | | | | | | | | | | | | | There is only single eq entry for arc farm sei event which aggregates events from the four arc farms. Fix the code to handle this event according to this behavior. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* | accel/habanalabs: remove Gaudi1 multi MSI codeOfir Bitton2023-04-082-93/+5
| | | | | | | | | | | | | | | | | | | | Multi MSI interrupts aren't working in Gaudi1 and because of that, we are only using a single MSI interrupt. Therefore, let's remove this dead code in order to avoid confusion. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* | accel/habanalabs: fixes for unexpected error interruptOfir Bitton2023-04-082-5/+2
| | | | | | | | | | | | | | | | | | Removing redundant asic prop variable as we don't need to expose this to common code. In addition, fix some typos. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* | accel/habanalabs: don't wait for STS_OK after sending COMMS WFEKoby Elbaz2023-04-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Sending COMMS_GOTO_WFE instructs the FW's CPU to halt (WFE state). Once sent, FW's CPU isn't expected to continue communicating with LKD. Therefore, the stage of waiting for COMMS_STS_OK should be skipped or else waiting for COMMS_STS_OK will simply timeout, which will trigger unexpected behavior. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* | accel/habanalabs: sync f/w events interrupt in hard resetTal Cohen2023-04-085-25/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Receiving events from FW, while the device is in hard reset, causes a warning message in Driver log. The message may point to a problem in the Driver or FW. But It also can appear as a result of events that have been sent from FW just before the hard reset. In order to avoid receiving events from FW while the device is in reset and is already in 'disabled' mode, sync the f/w events interrupt right before setting the device to 'disabled'. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* | accel/habanalabs: fix wrong reset and event flagsOfir Bitton2023-04-081-4/+9
| | | | | | | | | | | | | | | | | | During event handling, driver sets relevant reset and user event notifier flags. Fix few wrong flags settings. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* | accel/habanalabs: fix events mask of decoder abnormal interruptsTomer Tayar2023-04-081-7/+11
| | | | | | | | | | | | | | | | | | | | The decoder IRQ status register may have several set bits upon an abnormal interrupt. Therefore, when setting the events mask, need to check all bits and not using if-else. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* | accel/habanalabs: remove completion from abnormal interrupt work nameTomer Tayar2023-04-083-20/+14
| | | | | | | | | | | | | | | | | | Decoder abnormal interrupts are for errors and not for completion, so rename the relevant work and work function to not include 'completion'. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* | accel/habanalabs: print raw binning masks in debug levelOfir Bitton2023-04-081-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | There are rare cases of failures when cards are initialized due to wrong values in efuse mappings that are parsed by firmware. To help debug those cases, print (in debug level) the raw binning masks as fetched from the firmware during device initialization. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>