summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* backend: refine the local copy propagation.rander.wang2017-06-161-0/+34
| | | | | | | | | | | | src modifier is not supported by some instructions. so return false when it exists. This fix piglit % scalar-arithmetic-int failed V2: (1)add hadd rhadd (2)confirmed math functions support midifer except IDIV/Mod Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Utset: Add test case for cl_intel_required_subgroup_size extensionPan Xiuli2017-06-165-0/+75
| | | | | | | | | | Check the device supported subgroup sizes, and use intel_reqd_sub_group_size to build kernels in these size. Then check if there is spill for each kernel. V2: Fix memory leak Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Runtime: Add new API enums for cl_intel_required_subgroup_size extensionPan Xiuli2017-06-167-0/+50
| | | | | | | | | | | Add CL_DEVICE_SUB_GROUP_SIZES_INTEL for clGetDeviceInfo, add CL_KERNEL_SPILL_MEM_SIZE_INTEL for clGetKernelWorkGroupInfo and add CL_KERNEL_COMPILE_SUB_GROUP_SIZE_INTEL for clGetKernelSubGroupInfo. We only have this extension for LLVM 40+ for frontend support. V2: Add opencl-c define Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Backend: Add intel_reqd_sub_group_size supportPan Xiuli2017-06-163-13/+45
| | | | | | | | | | If we get intel_reqd_sub_group_size attribute from frontend then set it to backend. V2: Refine the codeGenNum with runtime caclculate and fail the build if the size from frontend is illegal. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* do constant folding for kernel struct argsGuo Yejun2017-06-166-0/+213
| | | | | | | | | | | | | | | | | | | | | | | for the following GEN IR, %41 is kernel argument (struct) the first LOAD will be mov, and the second LOAD will be indirect move (see lowerFunctionArguments). It hurts performance, and even impacts the correctness of reg liveness of indriect mov LOADI.uint64 %1114 72 ADD.int64 %78 %41 %1114 LOAD.int64.private.aligned {%79} %78 bti:255 LOADI.int64 %1115 8 ADD.int64 %1116 %78 %1115 LOAD.int64.private.aligned {%80} %1116 bti:255 this function folds the constants of 72 and 8 together, and so it will be direct mov. the GEN IR looks like: LOADI.int64 %1115 80 ADD.int64 %1116 %41 %1115 Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Use aligned16 and aligne4 kernel to copy for large 3D image with TILE_Y.Yan Wang2017-06-147-37/+149
| | | | | | | It is similar with 2D image for avoiding extended image width truncated. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Add test case for large 3D image with TILE_Y.Yan Wang2017-06-141-0/+98
| | | | | | | It will test aligned4 and aligned16 kernel for 3D image. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Optimize clEnqueueWriteImageByKernel and clEnqueuReadImageByKernel.Yan Wang2017-06-131-7/+18
| | | | | | | | 1. Only copy the data by origin and region defined. 2. Add clFinish to guarantee the kernel copying is finished when blocking writing. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Fix bug of clEnqueueUnmapMemObjectForKernel and clEnqueueMapImageByKernel.Yan Wang2017-06-131-34/+113
| | | | | | | | | | | 1. Support wrrting data by mapping/unmapping mode. 2. Add mapping record logic. 3. Add clFinish to guarantee the kernel copying is finished. 4. Fix the error of calling clEnqueueMapImageByKernel. blocking_map and map_flags need be switched. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Add clFinish for guarantee the kernel copying is finished when create TILE_Y ↵Yan Wang2017-06-131-0/+7
| | | | | | | large image. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Add cl_mem_record_map_mem_for_kernel() for record map adress for TILE_Y ↵Yan Wang2017-06-132-26/+88
| | | | | | | image by kernel copying. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Add utest to test writing data into large image (TILE_Y) by map/unmap and ↵Yan Wang2017-06-131-0/+115
| | | | | | | USE_HOST_PTR mode. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Add utest to test writing data into large image (TILE_Y) by map/unmap mode.Yan Wang2017-06-131-0/+198
| | | | | | | It is used to reproduce the bug of clCopyImage/clFillImage of conformance test. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Add utest case for filling image by small region.Yan Wang2017-06-131-0/+50
| | | | | | | It is used to reproduce the bug of allocations of conformance test. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* utests: added for optimization negativeAddrander2017-06-093-1/+46
| | | | | | | | | | | | the negative Add is like: exp -a llvm transfer it to: add x -a, 0 exp x Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
* Backend: Add optimization for negative modifierrander2017-06-091-4/+28
| | | | | | | | | | | | | | LLVM transform Mad(a, -b, c) to Add b, -b, 0 Mad val, a, b, c pow(a,-b) and other buildin math function to the same instruction sequence like above for Gen support negtive modifier, mad(a, -b, c) is native suppoted. Do it just like a: mov b, -b, so it is a Mov operation like LocalCopyPropagation Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
* utests: add utest for sqrt-div optimizationrander2017-06-093-1/+71
| | | | | Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* backend: add sqrt-div pattern to instruction selectrander2017-06-091-0/+69
| | | | | | | | | | there some patterns like: sqrt r1, r2; load r4, 1.0; ===> rqrt r3, r2 div r3, r4, r1; Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
* Runtime: Fix a mssing llvm version marco for LLVM40+Pan Xiuli2017-06-091-1/+1
| | | | | | | Found a missing macro that need change to support LLVM40+. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* keep GEN IR as SSA styleGuo Yejun2017-06-091-3/+5
| | | | | Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* backend: refine exp function with float inputrander2017-06-091-2/+56
| | | | | | | | | | remove some corner cases check for these path can not be reached.And refine branch code to select. These improvements get 20% performance. and the performance of OCL_ExpFixture_Exp in opencv can match up to other Gen driver Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
* backend: refine hypot functionrander2017-06-091-14/+60
| | | | | | | | | the test OCL_Magnitude of opencv is slow on beignet because of hypot. refine the hypot, change algorithm and remove unnecessary code to get 30% up Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
* Fix bug of clEnqueueCopyBufferToImage and clEnqueueCopyImageToBuffer.Yan Wang2017-05-255-28/+89
| | | | | | | | | | | | | | "imagedim_non_pow_2" cases of basic modudle of confrmance shows regression after use TILE_Y mode for large image by previous patch. This bug comes from the non-align16 kernel of clEnqueueCopyBufferToImage and clEnqueueCopyImageToBuffer. It will force CL_RGBA/CL_UNORM_INT8/8191x8192 image of conformance test to CL_R/CL_UNSIGNED_INT8/32764x8192 image for copying. So it makes width as 8191 x 4 = 32764 and its width will exceed the maximum width (16 x 1024 = 16384) of GEN surface state structure which only has 14 bits. So use align4 copy kernel to avoid this bug. Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
* Add utest to reproduce the bug of imagedim_non_pow_2 cases of conformance test.Yan Wang2017-05-251-0/+46
| | | | Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
* build: fix cmake code generation dependencies.Ismo Puustinen2017-05-251-2/+2
| | | | | | | | There is a race condition between building .bc and header files and generating code from .cl targets. Fix the race by adding the dependency to generated files. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>
* refresh DAG when an arg has both direct and indirect readGuo Yejun2017-05-231-1/+16
| | | | | | | | | | | | when the return value is ARG_INDIRECT_READ, there is still possible that some IRs read it directly, and will be handled in buildConstantPush() so we need to refresh the dag afer function buildConstantPush another method is to update DAG accordingly, but i don't think it is easy compared with the refresh method, so i do not choose it. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Backend: Add sel ir output for MATH functionPan Xiuli2017-05-231-0/+42
| | | | | | | | We only output MATH function before, now we can know which math function is it. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* backend: fix tgamma error after restructurerander2017-05-231-25/+31
| | | | | Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
* Implement TILE_Y large image in clEnqueueWriteImage.Yan Wang2017-05-181-0/+46
| | | | | | | | It will fail to copy data from host ptr to TILE_Y large image by memcpy. Use clEnqueueCopyBufferToImage to do this on GPU side. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Implement TILE_Y large image in clEnqueueReadImage.Yan Wang2017-05-181-0/+55
| | | | | | | | It will fail to copy data from TILE_Y large image to buffer by memcpy. Use clEnqueueCopyImageToBuffer to do this on GPU side. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Implement TILE_Y large image in clEnqueueMapImage and clEnqueueUnmapMemObject.Yan Wang2017-05-181-0/+111
| | | | | | | | It will fail to copy data from TILE_Y large image to buffer by memcpy. Use clEnqueueCopyImageToBuffer to do this. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Create image with TILE_Y mode still when image size>128MB for performance.Yan Wang2017-05-184-6/+111
| | | | | | | | It may failed to copy data from host ptr to TILE_Y large image. So use clCopyBufferToImage to do this on GPU side. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Add image use_hostptr benchmark case for testing large image operations.Yan Wang2017-05-182-0/+85
| | | | | | | It is for testing large image with TILE_Y mode. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Add image use_hostptr case for testing large image operations.Yan Wang2017-05-182-0/+76
| | | | | | | It is for testing large image with TILE_Y mode. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Add image filling case for testing large image operations.Yan Wang2017-05-182-0/+121
| | | | | | | It is for testing large image with TILE_Y mode. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Add image copying case for testing large image operations.Yan Wang2017-05-182-0/+122
| | | | | | | It is for testing large image with TILE_Y mode. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Backend: Fix performance regression with sampler refine fro LLVM40Pan Xiuli2017-05-182-9/+41
| | | | | | | | | | | After the refine we can not know if a sampler is a constant initialized or not. Then the compiler optimization for constant sampler will break and we will runtime decide which SAMPLE instruction will use. Now fix the sampler refine for LLVM40 to enable the constant check. V2: Fix a typo of function __gen_ocl_sampler_to_int type. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
* Backend: Fix llvm40 assert about literal structsPan Xiuli2017-05-181-1/+2
| | | | | | | In llvm literal structs have no name, so check it first. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Guo, Yejun <yejun.guo@intel.com>
* backend: refine asin functionrander.wang2017-05-171-21/+7
| | | | | | | refine the algorithm to remove unnecessary operations Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
* backend: refine atanrander.wang2017-05-171-53/+58
| | | | | | | remove private array and convert if to select Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
* backend: refine acosrander.wang2017-05-171-4/+9
| | | | | | | refine algorithm to remove branch Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
* backend: refine sincosrander.wang2017-05-172-13/+277
| | | | | | | remove redundent operation to get more performance Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
* backend: refine tan functionrander.wang2017-05-171-16/+45
| | | | | | | get it from crlibm and refine it for gen Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
* backend: refine cos functionrander.wang2017-05-171-26/+25
| | | | | | | do it like sin function Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
* backend: refine sin functionrander.wang2017-05-171-20/+22
| | | | | | | | | (1)refine the NAN check (2)using sqrt to get cos (3)remove small range check Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
* backend: refine the argue reducerander.wang2017-05-171-24/+14
| | | | | | | using a simple algorithm to get it Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
* backend: refine pow functionrander.wang2017-05-171-134/+141
| | | | | | | | remove private array and some unnecessary if check. convert some if to select. improve about 50% performance Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
* backend: refine the code structure of mathrander.wang2017-05-179-7538/+4073
| | | | | | | | mov all the common math function to match_common.cl. it is easy to maitain Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
* GLK: add geminilake runtime support.Yang Rong2017-05-152-2/+47
| | | | | | | | Geminilake is almost same as bxt, except intel_gpgpu_read_ts_reg function. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
* GLK: add geminilake backend support.Yang Rong2017-05-155-2/+47
| | | | | | | Geminilake's backend is same as bxt. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>