Inheritance diagram for batch_generator:

Collaboration diagram for batch_generator:

Public Member Functions
	__init__ (self, parquet_path, variables, target, batch_size, chunk_size)

	__len__ (self)

	__getitem__ (self, idx)

Public Attributes
	variables = variables
	List of input variable names.

	target = target
	Name of target variable.

	batch_size = batch_size
	Batch size of the model.

	pf = pq.ParquetFile(parquet_path)
	Parquet metadata.

	n_chunks = self.pf.num_row_groups
	Number of chunks in the data file.

int	current_chunk_idx = 0
	Index of chunk currently in use.

	dataset_length
	Number of rows in datafile.

	chunk_lock = threading.Lock()
	Chunklock to avoid race conditions.

bool	chunk_ready = False
	Flag that indicated weather the new chunk is done loading into memory.

	loader_thread = None
	Thread that loads new chunk while main thread is training.

	chunk_in_use = self.chunk_in_waiting
	Chunk currently in use.

	max_batches = self.max_batches_next
	Number of batches in a chunk.

int	current_batch = 0
	Index of current batch in current chunk.

tuple	chunk_in_waiting = (X, y)
	Next chunk.

	max_batches_next = max_batches
	Maximum number of batches in this chunk.

Protected Member Functions
	_load_chunk (self)

	_start_async_load (self)

	_wait_for_chunk (self)

	_get_batch (self, batch_idx)

Detailed Description

Generator that reads the input data into memory in chunks.

Definition at line 18 of file fitter.py.

Constructor & Destructor Documentation

◆ init()

__init__	(	self,
		parquet_path,
		variables,
		target,
		batch_size,
		chunk_size )

Prepare all variables and prefetch 2 chunks.

Definition at line 23 of file fitter.py.

    def __init__(self, parquet_path, variables, target, batch_size, chunk_size):
        """
        Prepare all variables and prefetch 2 chunks.
        """
        super().__init__(workers=1, use_multiprocessing=False, max_queue_size=10)
        
        self.variables = variables
        
        self.target = target
        
        self.batch_size = batch_size
        
        self.pf = pq.ParquetFile(parquet_path)
        
        self.n_chunks = self.pf.num_row_groups
        
        self.current_chunk_idx = 0
        
        self.dataset_length = sum(
            self.pf.metadata.row_group(i).num_rows for i in range(self.n_chunks)
        )
 
        # Multithreading
        
        self.chunk_lock = threading.Lock()
        
        self.chunk_ready = False
        
        self.loader_thread = None
 
        # Prefetch first chunk
        self._start_async_load()
        self._wait_for_chunk()  # ensure first chunk is ready
        
        self.chunk_in_use = self.chunk_in_waiting
        
        self.max_batches = self.max_batches_next
 
        # Prepare next chunk
        self._start_async_load()
 
        
        self.current_batch = 0
 

Member Function Documentation

◆ getitem()

__getitem__	(		self,
			idx )

Returns the next batch used in training

Definition at line 73 of file fitter.py.

    def __getitem__(self, idx):
        """
        Returns the next batch used in training
        """
        if self.current_batch >= self.max_batches:
            self._wait_for_chunk()
 
            with self.chunk_lock:
                self.chunk_in_use = self.chunk_in_waiting
                self.max_batches = self.max_batches_next
                self.current_batch = 0
 
            self._start_async_load()
 
        X, y = self._get_batch(self.current_batch)
        self.current_batch += 1
        return X, y
 

◆ len()

__len__ ( self )

Returns number of batches in dataset

Definition at line 67 of file fitter.py.

    def __len__(self):
        """
        Returns number of batches in dataset
        """
        return self.dataset_length // self.batch_size
 

◆ _get_batch()

_get_batch	(		self,
			batch_idx )

protected

Extract next batch from chunk

Definition at line 130 of file fitter.py.

    def _get_batch(self, batch_idx):
        '''
        Extract next batch from chunk
        '''
        X, y = self.chunk_in_use
        i0 = batch_idx * self.batch_size
        i1 = i0 + self.batch_size
        return X[i0:i1], y[i0:i1]
 
 

◆ _load_chunk()

_load_chunk ( self )

protected

Load next chunk from datafile and shuffle it

Definition at line 91 of file fitter.py.

    def _load_chunk(self):
        """
        Load next chunk from datafile and shuffle it
        """
        rg = self.current_chunk_idx
        table = self.pf.read_row_group(rg)
        # Shuffle data
        df = table.to_pandas().sample(frac=1).reset_index(drop=True)
 
        X = df[self.variables].to_numpy()
        y = df[self.target].to_numpy()
        max_batches = len(df) // self.batch_size
 
        # Publish chunk
        with self.chunk_lock:
            
            self.chunk_in_waiting = (X, y)
            
            self.max_batches_next = max_batches
            self.chunk_ready = True
 
        # Move to next row group
        self.current_chunk_idx = (self.current_chunk_idx + 1) % self.n_chunks
 

◆ _start_async_load()

_start_async_load ( self )

protected

Start new thread to load new chunk

Definition at line 115 of file fitter.py.

    def _start_async_load(self):
        '''
        Start new thread to load new chunk
        '''
        self.chunk_ready = False
        self.loader_thread = threading.Thread(target=self._load_chunk, daemon=True)
        self.loader_thread.start()
 

◆ _wait_for_chunk()

_wait_for_chunk ( self )

protected

Sleep until second thread is finished with loading the next chunk

Definition at line 123 of file fitter.py.

    def _wait_for_chunk(self):
        '''
        Sleep until second thread is finished with loading the next chunk
        '''
        while not self.chunk_ready:
            time.sleep(5)
 

Member Data Documentation

◆ batch_size

batch_size = batch_size

Batch size of the model.

Definition at line 33 of file fitter.py.

◆ chunk_in_use

chunk_in_use = self.chunk_in_waiting

Chunk currently in use.

Definition at line 57 of file fitter.py.

◆ chunk_in_waiting

tuple chunk_in_waiting = (X, y)

Next chunk.

Definition at line 107 of file fitter.py.

◆ chunk_lock

chunk_lock = threading.Lock()

Chunklock to avoid race conditions.

Definition at line 47 of file fitter.py.

◆ chunk_ready

bool chunk_ready = False

Flag that indicated weather the new chunk is done loading into memory.

Definition at line 49 of file fitter.py.

◆ current_batch

int current_batch = 0

Index of current batch in current chunk.

Definition at line 65 of file fitter.py.

◆ current_chunk_idx

int current_chunk_idx = 0

Index of chunk currently in use.

Definition at line 39 of file fitter.py.

◆ dataset_length

dataset_length

Initial value:

=  sum(
            self.pf.metadata.row_group(i).num_rows for i in range(self.n_chunks)
        )

Number of rows in datafile.

Definition at line 41 of file fitter.py.

◆ loader_thread

loader_thread = None

Thread that loads new chunk while main thread is training.

Definition at line 51 of file fitter.py.

◆ max_batches

max_batches = self.max_batches_next

Number of batches in a chunk.

Definition at line 59 of file fitter.py.

◆ max_batches_next

max_batches_next = max_batches

Maximum number of batches in this chunk.

Definition at line 109 of file fitter.py.

◆ n_chunks

n_chunks = self.pf.num_row_groups

Number of chunks in the data file.

Definition at line 37 of file fitter.py.

◆ pf

pf = pq.ParquetFile(parquet_path)

Parquet metadata.

Definition at line 35 of file fitter.py.

◆ target

target = target

Name of target variable.

Definition at line 31 of file fitter.py.

◆ variables

variables = variables

List of input variable names.

Definition at line 29 of file fitter.py.

The documentation for this class was generated from the following file:

analysis/examples/tflat/fitter.py

Public Member Functions

Public Attributes

Protected Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ __init__()

Member Function Documentation

◆ __getitem__()

◆ __len__()

◆ _get_batch()

◆ _load_chunk()

◆ _start_async_load()

◆ _wait_for_chunk()

Member Data Documentation

◆ batch_size

◆ chunk_in_use

◆ chunk_in_waiting

◆ chunk_lock

◆ chunk_ready

◆ current_batch

◆ current_chunk_idx

◆ dataset_length

◆ loader_thread

◆ max_batches

◆ max_batches_next

◆ n_chunks

◆ pf

◆ target

◆ variables

◆ init()

◆ getitem()

◆ len()