Split the properties used for MMU mappings to DRAM and PCI (host) types.
This is a prerequisite for future ASICs support.
Note that in Goya ASIC, the PMMU and DMMU are the same (except of page
sizes) as only one MMU mechanism is used for both of the mapping types.
Hence this patch should not have any effect on current behavior.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Add the ability to invalidate the necessary MMU cache only.
This ability is a prerequisite for future ASICs support.
Note that in Goya ASIC, a single cache is used for both host/DRAM
mappings and hence this patch should not have any effect on current
behavior.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Some of the functions in the memory module code were too long and/or
contained multiple operations that are not always done together. Re-factor
the code by dividing those functions to smaller functions which are more
readable and maintainable.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
The two defines that control the maximum size of a command buffer and the
maximum number of JOBS per CS need to be exported to the user as they are
part of the API towards user-space.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
In training, there is a need for a large amount of patching to the recipe.
This results in many command buffers contains a lot of DMA packets. The
number of command buffers per CS is larger than the current maximum of 64,
which is an arbitrary number that is enough for inference, but it has no
real affect on the code and/or resources of the host machine.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
Add a new opcode to the INFO IOCTL to allow the user application to
retrieve the ASIC's current and maximum clock rate. The rate is
returned in MHz.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
This patch adds a support for a new H/W queue type.
This type of queue is for DMA and compute engines jobs, for which
completion notification are sent by H/W.
Command buffer for this queue can be created either through the CB
IOCTL and using the retrieved CB handle, or by preparing a buffer on the
host or device SRAM/DRAM, and using the device address to that buffer.
The patch includes the handling of the 2 options, as well as the
initialization of the H/W queue and its jobs scheduling.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Jobs on some queues must be provided with a handle to a driver command
buffer object, while for other queues, jobs must be provided with an
address to a command buffer.
Currently the distinction is done based on the queue type, which is less
flexible if the same queue type behaves differently on different
types of ASICs.
This patch adds a new queue property for this target, which is
configured per queue type per ASIC type.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
When using the macro le32_to_cpu(x), we need to correctly convert x to be
__le32 in case it is defined as u32 variable.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
We want to stop using the acronym KMD. Therefore, replace all locations
(except for register names we can't modify) where KMD is written to other
terms such as "Linux kernel driver" or "Host kernel driver", etc.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
Add a new opcode to INFO IOCTL to retrieve aggregate H/W events. i.e. the
events counters are NOT cleared upon device reset, but count from the
loading of the driver.
Add the code to support it in the device event handling function.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
Users and sysadmins usually want to know what is the device utilization as
a level 0 indication if they are efficiently using the device.
Add a new opcode to the INFO IOCTL that will return the device utilization
over the last period of 100-1000ms. The return value is 0-100,
representing as percentage the total utilization rate.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
The char devices are currently exposed to user before the device and
driver initialization are done.
This patch moves the cdev and device adding to the system to the end of
the initialization sequence, while keeping the creation of the
structures at the beginning to allow the usage of dev_*().
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch changes the driver to create two char devices for each ASIC
it discovers. This is done to allow system/monitoring applications to
query the device for stats, information, idle state and more, while also
allowing the deep-learning application to send work to the ASIC.
One char device is the original device, hlX. IOCTL calls through this
device file can perform any task on the device (compute, memory, queries).
The open function for this device will fail if it was called before but
the file-descriptor it created was not completely released yet (the
release callback function is not called from the kernel until all
instances of that FD are closed). The driver needs to keep this behavior
to support backward compatibility with existing userspace, which count
that the open will fail if the device is "occupied".
The second char device is called "hl_controlDx", where x is the same index
of the main device with a minor number of the original char device + 1.
Applications that open this device can only call the INFO IOCTL. There is
no limitation on the number of applications opening this device.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch adds a new list to the driver's device structure. The list will
keep the file private data structures that the driver creates when a user
process opens the device.
This change is needed because it is useless to try to count how many FD
are open. Instead, track our own private data structure per open file and
once it is released, remove it from the list. As long as the list is not
empty, it means we have a user that can do something with our device.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch renames the "user_ctx" field in the device structure to
"compute_ctx". This better reflects the meaning of this context.
In addition, we also check in the ctx_fini() that the debug mode should be
disabled only if the context being destroyed is the compute context. This
has no effect right now as we only have a single process and a single
context, but this makes the code more ready for multiple process support.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch adds a field to the context's structure that will hold a unique
handle for the context.
This will be needed when the user will create the context.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the driver timeout functions, we give the simulator a factor of 10
in the timeout. This was necessary when the requested timeout is small
but if it was a few seconds, this can result in a very large timeout which
is unnecessary.
This patch caps the maximum timeout of the simulator to 10 seconds, which
is our largest timeout in the code. That is more then enough for anything
the simulator is doing.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
The PQs of internal H/W queues (QMANs) can be located in different memory
areas for different ASICs. Therefore, when writing PQEs, we need to use
the correct function according to the location of the PQ. e.g. if the PQ
is located in the device's memory (SRAM or DRAM), we need to use
memcpy_toio() so it would work in architectures that have separate
address ranges for IO memory.
This patch makes the code that writes the PQE to be ASIC-specific so we
can handle this properly per ASIC.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Tested-by: Ben Segal <bpsegal20@gmail.com>
This patch fix a bug in the host memory polling macro. The bug is that the
memory being polled can be written by the device, which always writes it
in LE. However, if the host is running Linux in BE mode, we need to
convert the value that was written by the device before matching it to the
required value that the caller has given to the macro.
Signed-off-by: Ben Segal <bpsegal20@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
The information which is currently provided as a response to the
"HL_INFO_HW_IDLE" IOCTL is merely a general boolean value.
This patch extends it and provides also a bitmask that indicates which
of the device engines are busy.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Command submissions sent to the device are composed of command buffers
which are targeted to different device engines, like DMA and compute
entities. When a command submission gets stuck, knowing in which engine
the stuck is, is crucial for debugging.
This patch adds a debugfs node that exports this information, by
displaying the engines' various registers that assemble their idle/busy
status.
The information retrieval is based on the is_device_idle ASIC function.
The printout in this function, of the first detected busy engine, is
removed because it becomes redundant in the presence of the more
elaborated info of the new debugfs node.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch adds the necessary MMU mappings for the Goya CPU to access the
device DRAM and the host memory.
The first 256MB of the device DRAM is being mapped. That's where the F/W
is running.
The 2MB area located on the host memory for the purpose of communication
between the driver and the device CPU is also being mapped.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch removes a limitation on the maximum packet size that is read by
the device CPU as that limitation is not needed.
Therefore, the patch also removes an elaborate calculation that is based
on this limitation which is also not needed now. Instead, use a fixed
value for the memory pool size of the packets.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch increases the timeout for PCI ELBI configuration to support low
frequency Palladium images.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch adds a new parameter that is passed to the
add_end_of_cb_packets() asic-specific function.
The parameter is the pointer to the driver's device structure. The
function needs this pointer for future ASICs.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch changes two polling functions to macros, in order to make their
API the same as the standard readl_poll_timeout so we would be able to
define the "condition for exit" when calling these macros.
This will simplify the code as it will eliminate the need to check both
for timeout and for the (cond) in the calling function.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch adds the implementation of the HL_DEBUG_OP_SET_MODE opcode in
the DEBUG IOCTL.
It forces the user who wants to debug the device to set the device into
debug mode before he can configure the debug engines. The patch also makes
sure to disable debug mode upon user releasing FD, in case the user forgot
to disable debug mode.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch fixes comments on various structure members and some spelling
errors in log messages.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch fix a potential bug where a user's process has closed
unexpectedly without disabling the debug engines. In that case, the debug
engines might continue running but because the user's MMU mappings are
going away, we will get page fault errors.
This behavior is also opposed to the general rule where nothing runs on
the device after the user process closes.
The patch stops the debug H/W engines upon process termination and thus
makes sure nothing runs on the device after the process goes away.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Where there is a spike in the CPU consumption, it may cause
random failures in the C/I since the KMD timeout for CPU
and/or QMAN0 jobs expires and it stops communicating to the simulator.
This commit fixes it by increasing timeout on polling functions
if working with simulator.
Signed-off-by: Dalit Ben Zoor <dbenzoor@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
use_virt_addr member was used for telling whether to treat the
addresses in the CB as virtual during parsing. We disabled it only
when calling the parser from the driver memset device function,
and since this call had been removed, it should always be enabled.
Signed-off-by: Dalit Ben Zoor <dbenzoor@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Routing device accesses to the host memory requires the usage of a base
offset, which is canceled by the iATU just before leaving the device.
The value of the base offset might be distinctive between different ASIC
types.
The manipulation of the addresses is currently used throughout the
driver code, and one should be aware to it whenever providing a host
memory address to the device.
This patch removes this manipulation from the driver common code, and
moves it to the ASIC specific functions that are responsible for
host memory allocation/mapping.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch renames four functions in the ASIC-specific functions section,
so it will be easier to differentiate them from the generic kernel
functions with the same name.
This will help in future code reviews, to make sure we don't use the
kernel functions directly.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
The device's CPU accessible memory on host is managed in a dedicated
pool, except for 2 regions - Primary Queue (PQ) and Event Queue (EQ) -
which are allocated from generic DMA pools.
Due to address length limitations of the CPU, the addresses of all these
memory regions must have the same MSBs starting at bit 40.
This patch modifies the allocation of the PQ and EQ to be also from the
dedicated pool, to ensure compliance with the limitation.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch changes the ASIC interface function that changes the DRAM bar
window. The change is to return the old address that the DRAM bar pointed
to instead of an error code.
This simplifies the code that use this function (mainly in debugfs) to
restore the bar to the old setting.
This is also needed for easier support in future ASICs.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch only does renaming of certain variables and structure members,
and their accompanied comments.
This is done to better reflect the actions these variables and members
represent.
There is no functional change in this patch.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch slightly changes the macros of RREG32 and WREG32, which are
used when reading or writing from registers.
Instead of directly calling a function in the common code from these
macros, the new code calls a function from the ASIC functions interface.
This change allows us to share much more code between real ASICs and
simulators, which in turn reduces the maintenance burden and
the chances for forgetting to port code between the ASIC files.
The patch also implements the hl_poll_timeout macro, instead of calling
the generic readl_poll_timeout macro. This is required to allow use of
this macro in the simulator files.
As a result from this change, more functions in goya.c are shared with the
simulator and therefore, should not be defined as static.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
pr_fmt() should be defined before including linux/printk.h, either
directly or indirectly, in order to avoid redefinition of the macro.
Currently the macro definition is in habanalabs.h, which is included in
many files, and that makes the addition/reorder of includes to be prone
to compilation errors.
This patch cancels this dependency by defining the macro only in the few
source files that use it.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch removes the enum value of ASIC_AUTO_DETECT because we can use
the validity of the pdev variable to know whether we have a real device or
a simulator. For a real device, we detect the asic type from the device ID
while for a simulator, the simulator code calls create_hdev() with the
specified ASIC type.
Set ASIC_INVALID as the first option in the enum to make sure that no
other enum value will receive the value 0 (which indicates a non-existing
entry in the simulator array).
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch does some refactoring in goya.c to make code more reusable
between goya code and the goya simulator code (which is not upstreamed).
In addition, the patch removes some dead functions from goya.c which are
not used by the current upstream code
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Habanalabs ASICs use the ARM coresight infrastructure to support debug,
tracing and profiling of neural networks topologies.
Because the coresight is configured using register writes and reads, and
some of the registers hold sensitive information (e.g. the address in
the device's DRAM where the trace data is written to), the user must go
through the kernel driver to configure this mechanism.
This patch implements the common code of the IOCTL and calls the
ASIC-specific function for the actual H/W configuration.
The IOCTL supports configuration of seven coresight components:
ETR, ETF, STM, FUNNEL, BMON, SPMU and TIMESTAMP
The user specifies which component he wishes to configure and provides a
pointer to a structure (located in its process space) that contains the
relevant configuration.
The common code copies the relevant data from the user-space to kernel
space and then calls the ASIC-specific function to do the H/W
configuration.
After the configuration is done, which is usually composed
of several IOCTL calls depending on what the user wanted to trace, the
user can start executing the topology. The trace data will be written to
the user's area in the device's DRAM.
After the tracing operation is complete, and user will call the IOCTL
again to disable the tracing operation. The user also need to read
values from registers for some of the components (e.g. the size of the
trace data in the device's DRAM). In that case, the user will provide a
pointer to an "output" structure in user-space, which the IOCTL code will
fill according the to selected component.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch adds a new opcode to INFO IOCTL that returns the device status.
This will allow users to query the device status in order to avoid sending
command submissions while device is in reset.
Signed-off-by: Dalit Ben Zoor <dbenzoor@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch refactors the code that is responsible to set the DMA mask for
the device.
Upon each change of the dma mask, the driver will save the new value that
was set. This is needed in order to make sure we don't try to increase the
mask a second time, in case we failed in the first time. This is
especially relevant for Power machines, as that may cause a change in
configuration of the TVT which will break the device.
Goya will first try to set the device's dma mask to 39 bits, so that the
memory that is allocated on the host machine for communication with the
device's cpu will be in a bus address which is lower then 39 bits. Later,
Goya will try to increase that mask to 48 bits, but only if setting the
mask to 39 bits was successful.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch adds shadow mapping to the MMU module. The shadow mapping
allows traversing the page table in host memory rather reading each PTE
from the device memory.
It brings better performance and avoids reading from invalid device
address upon PCI errors.
Only at the end of map/unmap flow, writings to the device are performed in
order to sync the H/W page tables with the shadow ones.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Print the name of a busy engine when checking if a device is idle.
The change is done mainly to help a user to pinpoint problems in his
topology's recipe.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Move duplicated PCI-related code from ASIC-specific files into the common
pci.c file.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This patch moves the code that is responsible of the communication
vs. the F/W to a dedicated file. This will allow us to share the code
between different ASICs.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>