Commit Graph

9 Commits

Author SHA1 Message Date
Alexandre Ganea 8404aeb56a [Support] On Windows, ensure hardware_concurrency() extends to all CPU sockets and all NUMA groups
The goal of this patch is to maximize CPU utilization on multi-socket or high core count systems, so that parallel computations such as LLD/ThinLTO can use all hardware threads in the system. Before this patch, on Windows, a maximum of 64 hardware threads could be used at most, in some cases dispatched only on one CPU socket.

== Background ==
Windows doesn't have a flat cpu_set_t like Linux. Instead, it projects hardware CPUs (or NUMA nodes) to applications through a concept of "processor groups". A "processor" is the smallest unit of execution on a CPU, that is, an hyper-thread if SMT is active; a core otherwise. There's a limit of 32-bit processors on older 32-bit versions of Windows, which later was raised to 64-processors with 64-bit versions of Windows. This limit comes from the affinity mask, which historically is represented by the sizeof(void*). Consequently, the concept of "processor groups" was introduced for dealing with systems with more than 64 hyper-threads.

By default, the Windows OS assigns only one "processor group" to each starting application, in a round-robin manner. If the application wants to use more processors, it needs to programmatically enable it, by assigning threads to other "processor groups". This also means that affinity cannot cross "processor group" boundaries; one can only specify a "preferred" group on start-up, but the application is free to allocate more groups if it wants to.

This creates a peculiar situation, where newer CPUs like the AMD EPYC 7702P (64-cores, 128-hyperthreads) are projected by the OS as two (2) "processor groups". This means that by default, an application can only use half of the cores. This situation could only get worse in the years to come, as dies with more cores will appear on the market.

== The problem ==
The heavyweight_hardware_concurrency() API was introduced so that only *one hardware thread per core* was used. Once that API returns, that original intention is lost, only the number of threads is retained. Consider a situation, on Windows, where the system has 2 CPU sockets, 18 cores each, each core having 2 hyper-threads, for a total of 72 hyper-threads. Both heavyweight_hardware_concurrency() and hardware_concurrency() currently return 36, because on Windows they are simply wrappers over std:🧵:hardware_concurrency() -- which can only return processors from the current "processor group".

== The changes in this patch ==
To solve this situation, we capture (and retain) the initial intention until the point of usage, through a new ThreadPoolStrategy class. The number of threads to use is deferred as late as possible, until the moment where the std::threads are created (ThreadPool in the case of ThinLTO).

When using hardware_concurrency(), setting ThreadCount to 0 now means to use all the possible hardware CPU (SMT) threads. Providing a ThreadCount above to the maximum number of threads will have no effect, the maximum will be used instead.
The heavyweight_hardware_concurrency() is similar to hardware_concurrency(), except that only one thread per hardware *core* will be used.

When LLVM_ENABLE_THREADS is OFF, the threading APIs will always return 1, to ensure any caller loops will be exercised at least once.

Differential Revision: https://reviews.llvm.org/D71775
2020-02-14 10:24:22 -05:00
Michael Spencer 9ab6d8236b [clang-scan-deps] Add basic support for modules.
This fixes two issues that prevent simple uses of modules from working.

* We would previously minimize _every_ file opened by clang, even module maps
  and module pcm files. Now we only minimize files with known extensions. It
  would be better if we knew which files clang intended to open as a source
  file, but this works for now.

* We previously cached every lookup, even failed lookups. This is a problem
  because clang stats the module cache directory before building a module and
  creating that directory. If we cache that failure then the subsequent pcm
  load doesn't see the module cache and fails.

Overall this still leaves us building minmized modules on disk during scanning.
This will need to be improved eventually for performance, but this is correct,
and works for now.

Differential Revision: https://reviews.llvm.org/D68835
2019-10-24 16:19:11 -07:00
Michael J. Spencer 2f56266234 [ScanDeps] clang-format, 80 cols.
llvm-svn: 374439
2019-10-10 20:19:02 +00:00
Kousik Kumar 4abac53302 In openFileForRead don't cache erroneous entries if the error relates to them being directories. Add tests.
Summary:
It seems that when the CachingFileSystem is first given a file to open that is actually a directory, it incorrectly
caches that path to be errenous and throws an error when subsequently a directory open call is made for the same
path.
This change makes it so that we do NOT cache a path if it turns out we asked for a file when its a directory.

Reviewers: arphaman

Subscribers: dexonsmith, cfe-commits

Tags: #clang

Differential Revision: https://reviews.llvm.org/D68193

llvm-svn: 374366
2019-10-10 15:29:01 +00:00
Alex Lorenz ee30b0ecc2 [clang-scan-deps] Fix for headers having the same name as a directory
Scan deps tool crashes when called on a C++ file, containing an include
that has the same name as a directory.
The tool crashes since it finds foo/dir and tries to read that as a file and fails.

Patch by: kousikk (Kousik Kumar)

Differential Revision: https://reviews.llvm.org/D67091

llvm-svn: 371903
2019-09-13 22:12:02 +00:00
Alex Lorenz 428d92832c [clang-scan-deps] cast Result to ErrorOr<unique_ptr<vfs::File>> explicitly to avoid s390x-linux buildbot failure
llvm-svn: 371664
2019-09-11 21:00:13 +00:00
Alex Lorenz ca6e60971e [clang-scan-deps] add skip excluded conditional preprocessor block preprocessing optimization
This commit adds an optimization to clang-scan-deps and clang's preprocessor that skips excluded preprocessor
blocks by bumping the lexer pointer, and not lexing the tokens until reaching appropriate #else/#endif directive.
The skip positions and lexer offsets are computed when the file is minimized, directly from the minimized tokens.

On an 18-core iMacPro with macOS Catalina Beta I got 10-15% speed-up from this optimization when running clang-scan-deps on
the compilation database for a recent LLVM and Clang (3511 files).

Differential Revision: https://reviews.llvm.org/D67127

llvm-svn: 371656
2019-09-11 20:40:31 +00:00
Jonas Devlieghere 2b3d49b610 [Clang] Migrate llvm::make_unique to std::make_unique
Now that we've moved to C++14, we no longer need the llvm::make_unique
implementation from STLExtras.h. This patch is a mechanical replacement
of (hopefully) all the llvm::make_unique instances across the monorepo.

Differential revision: https://reviews.llvm.org/D66259

llvm-svn: 368942
2019-08-14 23:04:18 +00:00
Alex Lorenz e1f4c4aad2 [clang-scan-deps] Implementation of dependency scanner over minimized sources
This commit implements the fast dependency scanning mode in clang-scan-deps: the
preprocessing is done on files that are minimized using the dependency directives source minimizer.

A shared file system cache is used to ensure that the file system requests and source minimization
is performed only once. The cache assumes that the underlying filesystem won't change during the course
of the scan (or if it will, it will not affect the output), and it can't be evicted. This means that the
service and workers can be used for a single run of a dependency scanner, and can't be reused across multiple,
incremental runs. This is something that we'll most likely support in the future though.
Note that the driver still utilizes the underlying real filesystem.

This commit is also still missing the fast skipped PP block skipping optimization that I mentioned at EuroLLVM talk.
Additionally, the file manager is still not reused by the threads as well.

Differential Revision: https://reviews.llvm.org/D63907

llvm-svn: 368086
2019-08-06 20:43:25 +00:00