This is a simple implementation which just copies data synchronously. v2: - Use size_t. v3: - Fix possible race condition by splitting the copy among multiple work items. llvm-svn: 219008