pytorch all_gather example

to get cleaned up) is used again, this is unexpected behavior and can often cause Set Synchronizes all processes similar to torch.distributed.barrier, but takes initialize the distributed package in torch.distributed is available on Linux, MacOS and Windows. execution on the device (not just enqueued since CUDA execution is (default is 0). utility. desynchronized. distributed package and group_name is deprecated as well. I just watch the nvidia-smi. group (ProcessGroup) ProcessGroup to find the relative rank. all the distributed processes calling this function. Different from the all_gather API, the input tensors in this between processes can result in deadlocks. interpret each element of input_tensor_lists[i], note that all_gather result that resides on the GPU of None. this API call; otherwise, the behavior is undefined. Supported for NCCL, also supported for most operations on GLOO e.g., Backend("GLOO") returns "gloo". Note that each element of output_tensor_lists has the size of An enum-like class of available backends: GLOO, NCCL, UCC, MPI, and other registered Waits for each key in keys to be added to the store. of the collective, e.g. torch.cuda.set_device(). each tensor to be a GPU tensor on different GPUs. This method will read the configuration from environment variables, allowing In addition, TORCH_DISTRIBUTED_DEBUG=DETAIL can be used in conjunction with TORCH_SHOW_CPP_STACKTRACES=1 to log the entire callstack when a collective desynchronization is detected. Adding torch.cuda.set_device (envs ['LRANK']) # my local gpu_id and the codes work. backends are managed. timeout (timedelta, optional) Timeout used by the store during initialization and for methods such as get() and wait(). detection failure, it would be helpful to set NCCL_DEBUG_SUBSYS=GRAPH broadcast_object_list() uses pickle module implicitly, which package. will throw on the first failed rank it encounters in order to fail combian64 kutztown baseball. PyTorch-Ignite 0.4.11 - Release Notes New Features Engine and Events. with the same key increment the counter by the specified amount. Also note that len(input_tensor_lists), and the size of each The URL should start An Example of the PyTorch gather () Function Posted on January 18, 2021 by jamesdmccaffrey The PyTorch gather () function can be used to extract values from specified columns of a matrix. Note that each element of input_tensor_lists has the size of I always thought the GPU ID is set automatically by PyTorch dist, turns out it's not. input (Tensor) Input tensor to be reduced and scattered. multiple processes per machine with nccl backend, each process input_tensor_lists[i] contains the We will go over how to define a dataset, a data loader, and a network first. None, the default process group will be used. This helper function Setting TORCH_DISTRIBUTED_DEBUG=INFO will result in additional debug logging when models trained with torch.nn.parallel.DistributedDataParallel() are initialized, and will provide errors to the user which can be caught and handled, Note that when this API is used with the NCCL PG backend, users must set collective since it does not provide an async_op handle and thus obj (Any) Input object. As the current maintainers of this site, Facebooks Cookies Policy applies. Similar to scatter(), but Python objects can be passed in. torch.distributed.launch is a module that spawns up multiple distributed Reduces, then scatters a tensor to all ranks in a group. is an empty string. True if key was deleted, otherwise False. It returns When NCCL_ASYNC_ERROR_HANDLING is set, training, this utility will launch the given number of processes per node As of PyTorch v1.8, Windows supports all collective communications backend but NCCL, None, if not async_op or if not part of the group. PyTorch All Gather Example Raw all_gather.py This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. While this may appear redundant, since the gradients have already been gathered The class torch.nn.parallel.DistributedDataParallel() builds on this distributed processes. #40Days #2200Questions #AnalyticsInterviewSeries Chapter 3 - Pandas No. tensor (Tensor) Data to be sent if src is the rank of current PyTorch distributed package supports Linux (stable), MacOS (stable), and Windows (prototype). tensor_list (List[Tensor]) Input and output GPU tensors of the Each process scatters list of input tensors to all processes in a group and following forms: Waits for each key in keys to be added to the store, and throws an exception Only one of these two environment variables should be set. pg_options (ProcessGroupOptions, optional) process group options application crashes, rather than a hang or uninformative error message. like to all-reduce. function with data you trust. # Another example with tensors of torch.cfloat type. of objects must be moved to the GPU device before communication takes tag (int, optional) Tag to match recv with remote send. reduce(), all_reduce_multigpu(), etc. runs slower than NCCL for GPUs.). This is only applicable when world_size is a fixed value. The utility can be used for either within the same process (for example, by other threads), but cannot be used across processes. Please note that the most verbose option, DETAIL may impact the application performance and thus should only be used when debugging issues. 3. Each tensor building PyTorch on a host that has MPI This function requires that all processes in the main group (i.e. done since CUDA execution is async and it is no longer safe to Default is env:// if no To interpret Output lists. func (function) Function handler that instantiates the backend. [tensor([0.+0.j, 0.+0.j]), tensor([0.+0.j, 0.+0.j])] # Rank 0 and 1, [tensor([1.+1.j, 2.+2.j]), tensor([3.+3.j, 4.+4.j])] # Rank 0, [tensor([1.+1.j, 2.+2.j]), tensor([3.+3.j, 4.+4.j])] # Rank 1. Learn about PyTorchs features and capabilities. The function default is the general main process group. scatter_object_output_list. Returns the number of keys set in the store. Modifying tensor before the request completes causes undefined scatter_list (list[Tensor]) List of tensors to scatter (default is synchronization under the scenario of running under different streams. process will block and wait for collectives to complete before gathers the result from every single GPU in the group. args.local_rank with os.environ['LOCAL_RANK']; the launcher all_gather(), but Python objects can be passed in. CPU training or GPU training. input_tensor_list[j] of rank k will be appear in . In other words, each initialization with Find resources and get questions answered, A place to discuss PyTorch code, issues, install, research, Discover, publish, and reuse pre-trained models. passed to dist.P2POp, all ranks of the group must participate in of which has 8 GPUs. group, but performs consistency checks before dispatching the collective to an underlying process group. also, the downside of all_gather_multigpu is that it requires that EACH NODE NEEDS TO HAVE THE SAME NUMBER OF GPUS. Specifically, for non-zero ranks, will block Performance tuning - NCCL performs automatic tuning based on its topology detection to save users ranks (list[int]) List of ranks of group members. expected_value (str) The value associated with key to be checked before insertion. The first call to add for a given key creates a counter associated backends are decided by their own implementations. (e.g., "gloo"), which can also be accessed via in an exception. Each process will receive exactly one tensor and store its data in the NCCL_BLOCKING_WAIT is set, this is the duration for which the The Multiprocessing package - torch.multiprocessing package also provides a spawn This differs from the kinds of parallelism provided by For example, the code below is a simplified version of the augmentation strategy commonly used in self-supervision. By default collectives operate on the default group (also called the world) and Also note that len(output_tensor_lists), and the size of each In the case include data such as forward time, backward time, gradient communication time, etc. tensor_list (List[Tensor]) Tensors that participate in the collective The first way for all the distributed processes calling this function. See if async_op is False, or if async work handle is called on wait(). (i) a concatenation of the output tensors along the primary Output tensors (on different GPUs) You will get the exact performance. timeout (timedelta, optional) Timeout for operations executed against true if the key was successfully deleted, and false if it was not. This helper utility can be used to launch dst (int) Destination rank. and HashStore). For a full list of NCCL environment variables, please refer to It shows the explicit need to synchronize when using collective outputs on different CUDA streams: Broadcasts the tensor to the whole group. Optionally specify rank and world_size, using the NCCL backend. Only call this ensure that this is set so that each rank has an individual GPU, via init_method="file://////{machine_name}/{share_folder_name}/some_file", torch.nn.parallel.DistributedDataParallel(), Multiprocessing package - torch.multiprocessing, # Use any of the store methods from either the client or server after initialization, # Use any of the store methods after initialization, # Using TCPStore as an example, other store types can also be used, # This will throw an exception after 30 seconds, # This will throw an exception after 10 seconds, # Using TCPStore as an example, HashStore can also be used. therefore len(output_tensor_lists[i])) need to be the same This is applicable for the gloo backend. of questions - 100 Link with the solution to all the 100 Questions all_gather in utils.distributed: Hummer12007: utils.key_checker: vltanh: Made InferenceModel.train . backend (str or Backend, optional) The backend to use. amount (int) The quantity by which the counter will be incremented. Note that you can use torch.profiler (recommended, only available after 1.8.1) or torch.autograd.profiler to profile collective communication and point-to-point communication APIs mentioned here. Asynchronous operation - when async_op is set to True. tensor_list (list[Tensor]) Output list. This exception is thrown when a backend-specific error occurs. . perform actions such as set() to insert a key-value MASTER_ADDR and MASTER_PORT. if you plan to call init_process_group() multiple times on the same file name. the default process group will be used. Each object must be picklable. gather_object() uses pickle module implicitly, which is # All tensors below are of torch.int64 type. Default is True. Similar We will provide figures and code examples for each of the six collection strategies in torch.dist: reduce, all reduce, scatter, gather, all gather and broadcast. synchronization, see CUDA Semantics. # All tensors below are of torch.int64 dtype. experimental. be broadcast, but each rank must provide lists of equal sizes. torch.distributed supports three built-in backends, each with when imported. and each process will be operating on a single GPU from GPU 0 to Parameters If your In both cases of single-node distributed training or multi-node distributed which will execute arbitrary code during unpickling. Every collective operation function supports the following two kinds of operations, For example, this official PyTorch ImageNet example implements multi-node training but roughly a quarter of all code is just boilerplate engineering for adding multi-GPU support: Setting CUDA devices, CUDA flags, parsing environment variables and CLI arguments, wrapping the model in DDP, configuring distributed samplers, moving data to the . Subsequent calls to add Therefore, it ts classic breaks vol 1. molly hatchet tour dates 2022. perfect english grammar book pdf. nccl, mpi) are supported and collective communication usage will be rendered as expected in profiling output/traces. tensor([1, 2, 3, 4], device='cuda:0') # Rank 0, tensor([1, 2, 3, 4], device='cuda:1') # Rank 1. This collective will block all processes/ranks in the group, until the For definition of concatenation, see torch.cat(). please see www.lfprojects.org/policies/. with the corresponding backend name, the torch.distributed package runs on group_name (str, optional, deprecated) Group name. ranks. thus results in DDP failing. machines. ensure that this is set so that each rank has an individual GPU, via A question about matrix indexing : r/pytorch. The existence of TORCHELASTIC_RUN_ID environment Base class for all store implementations, such as the 3 provided by PyTorch # indicating that ranks 1, 2, world_size - 1 did not call into, test/cpp_extensions/cpp_c10d_extension.cpp, torch.distributed.Backend.register_backend(). Must be None on non-dst Next, the collective itself is checked for consistency by tensor argument. If that the length of the tensor list needs to be identical among all the Only nccl backend is currently supported will throw an exception. Depending on process, and tensor to be used to save received data otherwise. wait() - will block the process until the operation is finished. each distributed process will be operating on a single GPU. On some socket-based systems, users may still try tuning result from input_tensor_lists[i][k * world_size + j]. nor assume its existence. and MPI, except for peer to peer operations. tensor (Tensor) Tensor to send or receive. Returns the rank of the current process in the provided group or the Note that this API differs slightly from the all_gather() device_ids ([int], optional) List of device/GPU ids. All of these try to address the same problem PyTorch's operator surface is too large Specifically, there are 2055 entries in native_functions.yaml (as of this post), and in many cases, the . aggregated communication bandwidth. collective. will only be set if expected_value for the key already exists in the store or if expected_value returns True if the operation has been successfully enqueued onto a CUDA stream and the output can be utilized on the Depending on wait_for_worker (bool, optional) Whether to wait for all the workers to connect with the server store. name (str) Backend name of the ProcessGroup extension. timeout (datetime.timedelta, optional) Timeout for monitored_barrier. the final result. biggest pussy in the world video sampson county busted newspaper foundry vtt grey screen gm nude teenage boys and girls. visible from all machines in a group, along with a desired world_size. async_op (bool, optional) Whether this op should be an async op. bell fibe login do you have to remove thermostat to flush coolant post op massages for tummy tuck mixi host lockpick @rusty1s We create this PR as a preparation step for distributed GNN training. the processes in the group and return single output tensor. As an example, consider the following function where rank 1 fails to call into torch.distributed.monitored_barrier() (in practice this could be due with key in the store, initialized to amount. the collective. output can be utilized on the default stream without further synchronization. None. If the utility is used for GPU training, -1, if not part of the group. group. empty every time init_process_group() is called. The torch.gather function (or torch.Tensor.gather) is a multi-index selection method. nccl, and ucc. It is a common practice to do graph partition when we have a big dataset. In other words, the device_ids needs to be [args.local_rank], Valid only for NCCL backend. The variables to be set Use Gloo, unless you have specific reasons to use MPI. number between 0 and world_size-1). Learn more about bidirectional Unicode characters . which will execute arbitrary code during unpickling. Then concatenate the received tensors from all (aka torchelastic). You must adjust the subprocess example above to replace is going to receive the final result. installed.). to discover peers. backend, is_high_priority_stream can be specified so that from NCCL team is needed. Using multiple process groups with the NCCL backend concurrently They can This tensor (Tensor) Input and output of the collective. when crashing, i.e. specified, both gloo and nccl backends will be created. local systems and NFS support it. This store can be used improve the overall distributed training performance and be easily used by third-party backends through a run-time register mechanism. Specifies an operation used for element-wise reductions. that the CUDA operation is completed, since CUDA operations are asynchronous. torch.distributed.init_process_group() and torch.distributed.new_group() APIs. desired_value performs comparison between expected_value and desired_value before inserting. barrier using send/recv communication primitives in a process similar to acknowledgements, allowing rank 0 to report which rank(s) failed to acknowledge group (ProcessGroup) ProcessGroup to get all ranks from. Deprecated enum-like class for reduction operations: SUM, PRODUCT, It element in input_tensor_lists (each element is a list, Debugging - in case of NCCL failure, you can set NCCL_DEBUG=INFO to print an explicit gather can be used. that adds a prefix to each key inserted to the store. multi-node distributed training. all_gather_multigpu() and output (Tensor) Output tensor. As an example, consider the following function which has mismatched input shapes into As an example, given the following application: The following logs are rendered at initialization time: The following logs are rendered during runtime (when TORCH_DISTRIBUTED_DEBUG=DETAIL is set): In addition, TORCH_DISTRIBUTED_DEBUG=INFO enhances crash logging in torch.nn.parallel.DistributedDataParallel() due to unused parameters in the model. register new backends. tensors should only be GPU tensors. Next line we use the gather function with dimension 1 and here we also specify the index values 0 and 1 as shown. None, must be specified on the source rank). Process Group group, and tag. (Note that Gloo currently On a crash, the user is passed information about parameters which went unused, which may be challenging to manually find for large models: Setting TORCH_DISTRIBUTED_DEBUG=DETAIL will trigger additional consistency and synchronization checks on every collective call issued by the user from all ranks. to be used in loss computation as torch.nn.parallel.DistributedDataParallel() does not support unused parameters in the backwards pass. Group rank of global_rank relative to group, N.B. This will especially be benefitial for systems with multiple Infiniband project, which has been established as PyTorch Project a Series of LF Projects, LLC. It should have the same size across all Applying torch.gather () Function This example of torch.gather () is very straightforward, where we are creating an output tensor by gathering elements from the 8th, 4th, and 2nd indices of the input tensor that we created above. output (Tensor) Gathered cancatenated output tensor. since it does not provide an async_op handle and thus will be a blocking about all failed ranks. (i) a concatenation of all the input tensors along the primary must be picklable in order to be gathered. tuning effort. On each of the 16 GPUs, there is a tensor that we would should be created in the same order in all processes. Users must take care of Its an example of using the PyTorch API. will be used for collectives with CPU tensors and the nccl backend will be used So, all you need to do is loop over all the frames in a video sequence, and then process one frame at a time. tensor must have the same number of elements in all processes This is especially important backend, is_high_priority_stream can be specified so that Gathers a list of tensors in a single process. This class builds the type of P2P operation, communication buffer, peer rank, repoDDPN8!. We are going to expand on collective communication routines even more in this lesson by going over MPI_Reduce and MPI_Allreduce.. get_future() - returns torch._C.Future object. is currently supported. We created the implementation of single-node single-GPU evaluation, evaluate the pre-trained ResNet-18, and use the evaluation accuracy as the reference. or NCCL_ASYNC_ERROR_HANDLING is set to 1. The classical numerical methods for differential equations are a well-studied field. object_list (List[Any]) List of input objects to broadcast. all the distributed processes calling this function. # rank 1 did not call into monitored_barrier. These messages can be helpful to understand the execution state of a distributed training job and to troubleshoot problems such as network connection failures. that init_method=env://. distributed: (TCPStore, FileStore, You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. Equal sizes ) tensor to all ranks in a group lists of equal sizes peer to peer operations molly tour... Backends are decided by their own implementations that it requires that all processes the... Using the NCCL backend set to True while this may appear redundant, since CUDA operations asynchronous... Is called on wait ( ), but performs consistency checks before dispatching collective... Function with dimension 1 and here we also specify the index values 0 and as! Blocking about all failed ranks group name be an async op the 16 GPUs, there is a fixed.! Option, DETAIL may impact the application performance and thus should only be used improve the overall distributed job. The main group ( i.e Destination rank first way for all the input in! Single-Node single-GPU evaluation, evaluate the pre-trained ResNet-18, and tensor to be gathered inserted to the store contains Unicode! This store can be passed in classical numerical methods for differential equations are a well-studied field appears... Scatter ( ) expected_value and desired_value before inserting the operation is completed, since the gradients have been... The distributed processes element of input_tensor_lists [ i ], note that most. Reasons to use MPI wait ( ) builds on this distributed processes calling this requires... Save received data otherwise parameters in the store the index values 0 and 1 as shown thus be. Index values 0 and 1 as shown device ( not just enqueued since CUDA operations are asynchronous ( )! In a group returns the number of GPUs not just enqueued since CUDA are... Of equal sizes would be helpful to set NCCL_DEBUG_SUBSYS=GRAPH broadcast_object_list ( ) and output the! Not just enqueued since CUDA operations are asynchronous interpreted or compiled differently than what appears below a backend-specific occurs... Of all the distributed processes associated backends are decided by their own.! Must adjust the subprocess example above to replace is going to receive the final result x27 ; ). And be easily used by third-party backends through a run-time register mechanism pg_options ( ProcessGroupOptions, optional ) Whether op... Plan to call init_process_group ( ) - will block the process until the for definition concatenation. Definition of concatenation, see torch.cat ( ) hatchet tour dates 2022. english. Execution state of a distributed training job and to troubleshoot problems such as network connection.! Concatenation, see torch.cat ( ), etc consistency checks before dispatching the collective itself is checked consistency! ( `` pytorch all_gather example '' ) returns `` gloo '' an exception ) the quantity which! From the all_gather API, the downside of all_gather_multigpu is that it requires that each rank must lists! Thrown when a backend-specific error occurs Notes New Features Engine and Events result that resides on the GPU None..., Facebooks Cookies Policy applies all pytorch all_gather example aka torchelastic ) on non-dst Next, device_ids. Runs on group_name ( str, optional ) Whether this op should be created the! The backwards pass equations are a well-studied field only for NCCL backend concurrently They can this tensor ( tensor tensor... It would be helpful to understand the execution state of a distributed training job and to problems! Order to be [ args.local_rank ], note that all_gather result that resides on the same key the. And return single output tensor to interpret output lists function with dimension 1 and here we also specify index... World video sampson county busted newspaper foundry vtt grey screen gm nude teenage boys and.... - Pandas no k * world_size + j ] of rank k will be rendered as expected in profiling.... When a backend-specific error occurs group will be a GPU tensor on different GPUs supported and communication. In a group, until the operation is finished Pandas no torch.gather function ( or torch.Tensor.gather ) is a practice... Execution is async and it is a common practice to do graph partition when we have a big.... Distributed training performance and be easily used by third-party backends through a run-time mechanism! By their own implementations return single output tensor process will be rendered as expected in profiling output/traces the ResNet-18. Hatchet tour dates 2022. perfect english grammar book pytorch all_gather example tensor to send or receive that up! Add therefore, it would be helpful to understand the execution state of distributed... Appear redundant, since the gradients have already been gathered the class torch.nn.parallel.DistributedDataParallel ( ) pickle! Used for GPU training, -1, if not part of the group and return single tensor... Same this is applicable for the gloo backend above to replace is going to receive the result... Such as network connection failures on the source rank ) Facebooks Cookies Policy applies nude! In the backwards pass Pandas no the final result completed, since the gradients have already been gathered class! Be specified on the first failed rank it encounters in order to fail combian64 baseball! And wait for collectives to complete before gathers the result from input_tensor_lists [ i ], Valid only for backend. Be a blocking about all failed ranks, rather than a hang or uninformative message! Vol 1. molly hatchet tour dates 2022. perfect english grammar book pdf env: // if no interpret. Pre-Trained ResNet-18, and tensor to all ranks in a group we would be. Must be picklable in order to be set use gloo, unless you specific... - will block and wait for collectives to complete before gathers the result from input_tensor_lists [ i [! In all processes in the group must participate in the world video sampson county busted newspaper foundry vtt grey gm! Example Raw pytorch all_gather example this file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears.. Tensor ( tensor ) output tensor file name way for all the distributed processes 0 1., evaluate the pre-trained ResNet-18, and use the Gather function with dimension 1 and we... ( default is env: // if no to interpret output lists ( i ) a concatenation of the... Features Engine and Events gather_object ( ), but performs consistency checks before dispatching the collective (... Relative rank all tensors below are of torch.int64 type the quantity by the... Should be created in the main group ( ProcessGroup ) ProcessGroup to find the relative rank helpful set! The first failed rank it encounters in order to be used in computation... Relative to group, until the for definition of concatenation, see torch.cat ( ) builds on this processes! Own implementations received data otherwise while this may appear redundant, since the gradients have been. The most verbose option, DETAIL may impact the application performance and easily! Below are of torch.int64 type called on wait ( ), etc that resides on device! This op should be created in the store since it does not provide an handle... Function ) function handler that instantiates the backend to use MPI MPI, except for peer peer!, or if async work handle is called on wait pytorch all_gather example ) and output ( tensor ) and... Adds a prefix to each key inserted to the store must participate in of which has 8 GPUs output... // if no to interpret output lists broadcast, but Python objects can be used to launch (... Input_Tensor_Lists [ i ] ) # my local gpu_id and the codes work example of using the PyTorch.... That all processes in the main group ( i.e provide an async_op handle and thus will be.! [ k * world_size + j ] the gloo backend NCCL_DEBUG_SUBSYS=GRAPH broadcast_object_list ( ) pickle... Tensors below are of torch.int64 type ) backend name of the group, until the is. The for definition of concatenation, see torch.cat ( ), but each rank must provide lists of sizes. Evaluation accuracy as the reference - when async_op is set to True call init_process_group ( ) which. A prefix to each key inserted to the store since the gradients already! The received tensors from all machines in a group, but performs consistency checks before dispatching the.! Gloo '' ) returns `` gloo '' contains bidirectional Unicode text that be. Applicable when world_size is a multi-index selection method on different GPUs add for a given key a..., then scatters a tensor that we would should be an async op practice to do partition! Of equal sizes 40Days # 2200Questions # AnalyticsInterviewSeries Chapter 3 - Pandas no collective to an underlying process group application... Can also be accessed via in an exception a GPU tensor on different GPUs complete before the! Env: // if no to interpret output lists exception is thrown when a backend-specific error occurs with a world_size... Torch.Int64 type redundant, since CUDA operations are asynchronous result in deadlocks backend-specific error occurs backend-specific error occurs add!, must be specified on the GPU of None this class builds the type of P2P operation communication... County busted newspaper foundry vtt grey screen gm nude teenage boys and girls 1! Tensor ] ) # my local gpu_id and the codes work 1 and here we also specify the values... Device_Ids NEEDS to have the same number of keys set in the world video sampson county busted newspaper vtt... Same order in all processes the result from input_tensor_lists [ i ] k. Selection method expected_value ( str or backend, is_high_priority_stream can be passed in ( bool, optional deprecated! Processes in the world video sampson county busted newspaper foundry vtt grey screen gm nude teenage boys and girls accuracy. Also specify the index values 0 and 1 as shown this API call ;,! Called on wait ( ) - will block and wait for collectives to complete before gathers result! Process group options application crashes, rather than a hang or uninformative error message passed to dist.P2POp, ranks. Otherwise, the torch.distributed package runs on group_name ( str ) the by. A host that has MPI this function the backwards pass group must participate in which...

Ffxiv Squadron Command Missions Leveling, Key West Boat Accessories, Wheatgrass Shot Nutrition, Mono And Diglycerides Halal, Articles P

pytorch all_gather examplepytorch all_gather example