Inheritance diagram for ArrayDataset:

Public Member Functions
def	__init__ (self, array, batch_size=1024, shuffle=True, seed=None, weighted=False)

def	__len__ (self)

def	maybe_permuted (self, array)

def	__iter__ (self)

def	__getitem__ (self, slicer)

Static Public Member Functions
def	to_tensor (array)

Public Attributes
	array
	Awkward array containing the dataset.

	batch_size
	Batch size for the iterable dataset.

	shuffle
	Whether to shuffle the data.

	seed
	Random seed for shuffling, consistent seed for all workers.

	weighted
	Whether the dataset includes weights.

Detailed Description

Dataset initialized from a pre-processed awkward array.

Use `torch.utils.data.DataLoader` with `collate_fn=lambda x: x[0]`
and `batch_size=1` to iterate through it.

Yields a tuple of a batched dgl graph and labels. Optionally also weights if
`weighted=True`. This requires a column `weight` in the array.

Definition at line 81 of file dataset.py.

Constructor & Destructor Documentation

◆ init()

def __init__	(	self,
		array,
		batch_size = `1024`,
		shuffle = `True`,
		seed = `None`,
		weighted = `False`
	)

Initialize the ArrayDataset for Pytorch training and inference.

:param array: Awkward array containing the dataset.
:param batch_size (int): Batch size for the iterable dataset.
:param shuffle (bool): Whether to shuffle the data.
:param seed: Random seed for shuffling.
:param weighted (bool): Whether the dataset includes weights.

Definition at line 92 of file dataset.py.

    ):
        """
        Initialize the ArrayDataset for Pytorch training and inference.
 
        :param array: Awkward array containing the dataset.
        :param batch_size (int): Batch size for the iterable dataset.
        :param shuffle (bool): Whether to shuffle the data.
        :param seed: Random seed for shuffling.
        :param weighted (bool): Whether the dataset includes weights.
        """
        
        self.array = array
        
        self.batch_size = batch_size
        
        self.shuffle = shuffle
        
        self.seed = seed if seed is not None else np.random.SeedSequence().entropy
        
        self.weighted = weighted
 

Member Function Documentation

◆ getitem()

def __getitem__	(	self,
		slicer
	)

Get a single instance or a new ArrayDataset for a slice.

Arguments:
    slicer (int or slice): Index or slice.

Returns:
    ArrayDataset: New ArrayDataset instance.

Definition at line 190 of file dataset.py.

    def __getitem__(self, slicer):
        """
        Get a single instance or a new ArrayDataset for a slice.
 
        Arguments:
            slicer (int or slice): Index or slice.
 
        Returns:
            ArrayDataset: New ArrayDataset instance.
        """
        kwargs = dict(
            batch_size=self.batch_size,
            shuffle=self.shuffle,
            seed=self.seed,
            weighted=self.weighted,
        )
        array = self.maybe_permuted(self.array)
        if not isinstance(slicer, int):
            return ArrayDataset(array[slicer], **kwargs)
        slicer = slice(slicer, slicer + 1)
        kwargs["batch_size"] = 1
        return next(iter(ArrayDataset(array[slicer], **kwargs)))

◆ iter()

def __iter__ ( self )

Iterate over batches with changing random seeds.

Yields:
    tuple: Batched dgl graph, labels, and optionally weights.

Definition at line 160 of file dataset.py.

    def __iter__(self):
        """
        Iterate over batches with changing random seeds.
 
        Yields:
            tuple: Batched dgl graph, labels, and optionally weights.
        """
        worker_info = torch.utils.data.get_worker_info()
        if worker_info is not None:
            num_workers = worker_info.num_workers
            worker_id = worker_info.id
        else:
            num_workers = 1
            worker_id = 0
        array = self.maybe_permuted(self.array)
        starts = list(range(0, len(self.array), self.batch_size))
        per_worker = np.array_split(starts, num_workers)
        for start in per_worker[worker_id]:
            ak_slice = array[start: start + self.batch_size]
            output = [
                get_batched_graph(ak_slice, DEFAULT_NODE_FEATURES),
                self.to_tensor(ak_slice.label),
            ]
            if self.weighted:
                output.append(self.to_tensor(ak_slice.weight))
            yield tuple(output)
        # increase the seed to get a new order of instances in the next iteration
        # note: need to use persistent_workers=True in the DataLoader for this to take effect
        self.seed += 1
 

◆ len()

def __len__ ( self )

Get the number of batches.

Returns:
    int: Number of batches.

Definition at line 120 of file dataset.py.

    def __len__(self):
        """
        Get the number of batches.
 
        Returns:
            int: Number of batches.
        """
        return int(math.ceil(len(self.array) / self.batch_size))
 

◆ maybe_permuted()

def maybe_permuted	(	self,
		array
	)

Possibly permute the array based on the shuffle parameter.

Arguments:
    array (awkward array): Input array.

Returns:
    array: Permuted or original array.

Definition at line 129 of file dataset.py.

    def maybe_permuted(self, array):
        """
        Possibly permute the array based on the shuffle parameter.
 
        Arguments:
            array (awkward array): Input array.
 
        Returns:
            array: Permuted or original array.
        """
        if not self.shuffle or len(self.array) == 1:
            return array
        perm = np.random.default_rng(self.seed).permutation(len(array))
        return self.array[perm]
 

◆ to_tensor()

def to_tensor ( array )

static

Convert an awkward array to a torch tensor.

Arguments:
    array (awkward array): Input awkward array.

Returns:
    torch.Tensor: Converted tensor.

Definition at line 145 of file dataset.py.

    def to_tensor(array):
        """
        Convert an awkward array to a torch tensor.
 
        Arguments:
            array (awkward array): Input awkward array.
 
        Returns:
            torch.Tensor: Converted tensor.
        """
        return torch.tensor(
            ak.to_numpy(array, allow_missing=False),
            dtype=torch.float32,
        ).reshape(-1, 1)
 

Member Data Documentation

◆ array

array

Awkward array containing the dataset.

Definition at line 110 of file dataset.py.

◆ batch_size

batch_size

Batch size for the iterable dataset.

Definition at line 112 of file dataset.py.

◆ seed

seed

Random seed for shuffling, consistent seed for all workers.

Definition at line 116 of file dataset.py.

◆ shuffle

shuffle

Whether to shuffle the data.

Definition at line 114 of file dataset.py.

◆ weighted

weighted

Whether the dataset includes weights.

Definition at line 118 of file dataset.py.

The documentation for this class was generated from the following file:

generators/scripts/smartBKG/utils/dataset.py

Public Member Functions

Static Public Member Functions

Public Attributes