探討Python中使用單個 dict 作為函式參數的陷阱與修正方式

10 min readFeb 22, 2024

在 Python 程式開發中，人們經常使用單個字典作為函式的參數。雖然這種方法使得開發速度飛快，但也可能導致可維護性降低。本文將探討單一 dict 作為函示的參數的潛在陷阱，並提供一些修正方式來改善程式碼品質。Jump to English version

什麼是好的程式碼？

好程式碼至少有以下特點：易理解命名方式、易理解的運作方式、容易擴展、容易維護、適當的註解與測試。這些特性是主觀的，同時也需要經驗判斷。

在軟體公司中的一份程式碼會多次經歷以下流程：開發者首先新增與變更程式碼，由另外一位審核者進行程式碼審查，確保功能正確被實作。審核者對整份程式碼不見得完全熟悉。以結果來說，好的程式碼會使得開發者新增功能的時間較短，同時審核者的程式碼審查時間也短，這是一個較好判斷程式碼好壞的重要標準。

另外一個線索是當程式碼有 bug 的時候，是否容易加入新的測試，而不會需要寫一個測試就包山包海的帶入很多程式碼。此外，當需求變更時，需要修改的程式碼量也是判斷方式。

使用單個 dict 作為函式參數

def f1(data: dict):
    print(f"data['bar']: {data['bar']}.")
    print(f"data['baz']: {data['baz']}.")

def f2(bar, baz):
    print(f"bar: {bar}.")
    print(f"baz: {baz}.")

data = {'bar': 'bar', 'baz': 1}

f1(data)
f2('bar', 1)

在 Python 中使用函式 f1 的作為函式 f2 寫法的替代品是常見的，產生原因推測是由於 Python 是動態型別語言（dynamic typing），所以可以輕易填裝不同型態數值到 dict 之中，而且在 Python 中 dict 是基本資料型態，最初學習時就會上手。相較之下，C++ 重寫 f1 是非常困難的：

#include <map>
#include <string>
#include <variant>
#include <cstdio>

void f1(const std::map<std::string, std::variant<std::string, int>>& data) {
    printf("data['bar']: %s.\n", std::get<std::string>(data.at("bar")).c_str());
    printf("data['baz']: %d.\n", std::get<int>(data.at("baz")));
}

void f2(const std::string& bar, int baz) {
    printf("bar: %s.\n", bar.c_str());
    printf("baz: %d.\n", baz);
}

int main() {
    std::map<std::string, std::variant<std::string, int>> data = {
      {"bar", std::string("bar")}, 
      {"baz", 1}
    };
    
    f1(data);
    f2("bar", 1)

    return 0;
}

在 C++ 中 f1 比 f2 難寫，並且`data`更難準備。那麽 Python 的 dict 與生俱來的優勢會帶來什麼反效果嗎？以下將以機器學習的大量超參數與資料處理流為例，描述一些可以耗費團隊數個月的令人髮指寫法。

機器學習：妥善處理大量超參數

在機器學習中的程式碼常涉及到大量超參數，由於每個函式具體的寫出所有參數費時，所以大部分人會用 dict 包住所有參數，如以下寫法為例：

class SimpleTrainer:
    def __init__(self, config: dict):
        self.config = config
        # 省略

    def fit(self):
        # ...
        for epoch_idx in range(self.config['max_epochs']):
            # ...
            save_model(self.config['ckpt_path'])
        # ...
        

config = load_config(config_path)
trainer = Trainer(config)

config['model'] = load_model(model_path);
config['train_dataloaders'] = get_train_dataloarders();
config['val_dataloaders']= get_val_dataloarders();
config['ckpt_path'] = ckpt_path;

trainer.fit()

SimpleTrainer 是個常見寫法，乍看之下乾淨簡潔，但它引發三個問題。

第一個問題是如果 config 參數過少，錯誤發生點時機會被延後，舉例而言，config 如果少了存檔點路徑 ckpt_path，程式也許在執行數小時後，開始儲存模型才會出現無法儲存的錯誤，但如果在傳入 function 的時候就發現少一個參數，就可以少浪費幾小時。

第二個問題是 config 如果多了參數，雖然通常不會造成邏輯錯誤，但會造成維護上的問題，舉例而言，也許有個參數曾經在 fit 內被使用，但後來沒有需要而被刪除了，由於 config 多了參數也不會有錯誤，所以一些從不用到的參數會逐漸被留下來，造成審查者的困擾。

第三個問題是 fit 不傳入任何參數，而是藉由 config 被隱式的傳入，除了開發者本人外得讀完 fit 整個 function，才會知道需要在 config 中放入其他四個參數才能正確使用。

更好的維護的寫法，實際上是將眾多的超參數寫下來，並且傳入適當的參數到方法 fit 內：

class Trainer:
    def __init__(
        self,
        accelerator="auto",
        strategy="auto",
        devices="auto",
        max_epochs=None,
        min_epochs=None,
        max_steps=-1,
        min_steps=None,
        max_time=None,
        limit_train_batches=1.0,
        limit_val_batches=1.0,
        limit_test_batches=1.0,
        precision=None,
        logger=None,
        callbacks=None,
        fast_dev_run=False,
    ):
        pass 

    def fit(
        model, 
        train_dataloaders=None, 
        val_dataloaders=None, 
        datamodule=None, 
        ckpt_path=None,
    ):
        pass

有人認為這會造成建立 Trainer 的時候，得寫很多參數，而且每次修改的__init__ 定義的時候，呼叫方也得修改，使得程式碼不好維護，就像是下面的寫法：

config = load_config(config_path)
trainer = Trainer(
    accelerator=config['accelerator'],
    strategy=config['strategy'],
    devices=config['devices'],
    max_epochs=config['max_epochs'],
    min_epochs=config['min_epochs'],
    max_steps=config['max_steps'],
    min_steps=config['min_steps'],
    max_time=config['max_time'],
    limit_train_batches=config['limit_train_batches'],
    limit_val_batches=config['limit_val_batches'],
    limit_test_batches=config['limit_test_batches'],
    precision=config['precision'],
    logger=config['logger'],
    callbacks=config['callbacks'],
    fast_dev_run=config['fast_dev_run'],
)

但其實只要利用 ** 字典解包，就可以變得非常簡單：

config = load_config(config_path)
trainer = Trainer(**config)

程式碼完全沒有比 SimpleTrainer 長，而且當 config 中有多或少參數的時候，Trainer 會立刻出錯。

Github上的學術研究的機器學習 repository 常常是使用 SimpleTrainer 的寫法，也因為這樣的寫法很常見，多數人不覺得這有問題。但學術研究用的程式碼或許是單人開發並且生命週期短，所以不構成大問題。如果在多人開發環境、生命週期拉長之後，浪費的時間會達到數週。Python 可以開發很快，但開發速度與可維護性需要平衡，我的建議是進行第一次程式碼審查前將寫法改為 Trainer 一樣，如果熟練的話，一開始就可以這樣寫了。

資料處理流：避免混亂地使用 dict

資料處理流中，使用單一 dict 作為傳入參數可以會導致混亂。這種情況可能是漸進式而且無意識的，以蛋糕製作為例子：

def make_cake(flavor, with_whipping_cream):
    # 取得所需食材
    ingredients = get_ingredients(flavor, with_whipping_cream)
    
    # 混合麵糊
    mixed_batter = mix_batter(ingredients)
    
    # 烘焙蛋糕
    raw_cake = bake(mixed_batter)
    
    # 裝飾蛋糕
    cake = decorate(raw_cake, ingredients)

    return cake

make_cake("chocolate", True)

這段程式碼沒有問題，每個 function 輸入輸出都清晰明瞭。但因為有監控、偵測、回溯、暫停、網路傳輸或框架限制的需求，將蛋糕製作過程的每個 function 封裝到 class 中，以統一的介面實現這些共用功能：

class Node():
    def run(self, input: dict) -> dict:
        raise NotImplementError

    def load(self, filepath):
        pass

    def save(self, filepath):
        pass
    
    def rollback(self, filepath):
        # 從錯誤中回溯
        pass


class GetIngredientsNode(Node):
    def run(self, input: dict) -> dict:
        # 實作
        pass

class Pipeline():
    def __init__(self, node_list):
        self.node_list = node_list

    def run(self, data: dict):
        for node in node_list:
            data = node.run(data)
        return data

make_cake_pipeline = Pipeline([
    GetIngredientsNode(),
    MixBatterNode(),
    BakeNode(),
    DecorateNode(),
])

data = {
    'flavor': "chocolate", 
    'with_whipping_cream': True,
}
data = make_cake_pipeline.run(data)
data['cake'] # 想要的蛋糕

"""
data = {
    'flavor': "chocolate", 
    'with_whipping_cream': True,
    'ingredients': ingredients,
    'raw_cake': raw_cake,
    'cake': cake
}
"""

於是設計了類別 Pipeline 跟 Node，並實作 GetIngredientsNode、MixBatterNode、BakeNode、DecorateNode 可以實現回溯功能。data 夾帶整個流程中的所有訊息，類似於全域變數。

這會造成團隊浪費大量時間。比方說 DecorateNode 執行 run 時發生錯誤，由 data 中其中一筆資料造成，這時候需要遍利之前的 Node，直到找到它被寫入附近的程式碼為止；另外，也有可能因為某個 Node偷偷覆蓋了與它無關的 key，造成需要查看所有的 Node。此外，錯誤會延遲到真的執行到某顆 Node 才會發生，儘管只是傳輸資料錯誤，這些都違反了好程式碼的原則。

如果每顆 Node 資料間的關係可以被清楚描述，而沒有隱含地透過 data 這個 dict，一個現代版的全域變數，跨越多顆 Node傳遞，除錯會簡單很多，請參考以下修改後程式碼：

class Node():
    def __init__(self, is_source=False, is_target=False):
        self.input = {}
        self.output = {}
        self._init_input_keys()
        self._init_output_keys()

    def _init_input_keys(self):
        # 約定設定輸入形式
        raise NotImplementedError

    def _init_output_keys(self):
        # 約定設定輸出形式
        raise NotImplementedError

    def verify_input(self):
        # 避免使用者放入無關的輸入資料
        for key in self.input:
            if key not in self.input_keys:
                raise ValueError
        # 檢查型態...

    def verify_output(self):
        # 避免使用者放入無關的輸出資料
        for key in self.output:
            if key not in self.output_keys:
                 raise ValueError
        # 檢查型態...

    def set_input(self):
        # 得到錯誤資料立刻停止
        self.input = data
        self.verify_input(data)
        
    def set_output(self, data: dict):
        # 得到錯誤資料立刻停止
        self.output = data
        verify_input(data)

    def run_imp(self, read_only_input: dict) -> dict:
        # 需要實作
        # 約定不能修改 read_only_input 內容
        raise NotImplementError

    def run(self):
        # 不可改寫以確保資料能被驗證
        self.verify_input()
        self.output = self.run_imp(self.input)
        self.verify_output()

class GetIngredientsNode(Node):
    def _init_input_keys(self):
        self.input_keys = {
            'flavor': Flavor,
        }

    def _init_output_keys(self):
        self.output_keys = {'ingredients': Ingredients}

    def process_imp(self, read_only_input) -> dict:
        # 取得原料的過程
        output['ingredients': Ingredients]
        return output

新的 Node 在 run 前，因為多定義了驗證輸入與輸出的合法性，使得錯誤不必等到運行其中才發生。

class Pipeline:
    def __init__(self, nodes, edges, source_node, target_node):
        self.nodes = nodes
        self.source_node = source_node
        self.target_node = target_node
        self.edges = edges

        # 避免無限循環執行
        self.check_no_cycle(edges) 

        # 輸入與輸出資料沒有缺少與多出
        self.check_link_fullfilled(edges) 

        # 確保起點與終點都只有一個
        self.check_single_src_dst(edges)

    def run(self, input):
        res = None
        
        # 根據資料相依性與起點，決定執行順序
        order = self.get_run_order()
        for node in order:
            if node is self.source_node:
                node.set_input(input)    
       
            # 取得已經算好的資料
            input = self.get_input(node)
            output = node.process(input)
            node.set_output(node, output)

            if node is self.target_node:
                res = output
        return res

    def load(self, filepath):
        pass

    def save(self, filepath):
        pass
    
    def rollback(self, filepath):
        pass

    def visualize(self, filepath):
        # 畫出資料相依性圖
        pass

get_ingredients_node = GetIngredientsNode()
mix_batter_node = MixBatterNode()
bake_node = BakeNode()
decorate_node = DecorateNode()

nodes = [
    get_ingredients_node, 
    mix_batter_node,
    bake_node,
    decorate_node,
]

# 建構資料關係
edges = {
  (id(get_ingredients_node), 'flavor'): (id(mix_batter_node), 'flavor'),
  (id(get_ingredients_node), 'with_whipping_cream'): (id(mix_batter_node), 'with_whipping_cream'), 
  (id(mix_batter_node), 'mixed_batter'): (id(bake_node), 'mixed_batter'),
  (id(bake_node), 'raw_cake'): (id(decorate_node), 'raw_cake'),
  (id(get_ingredients_node), 'ingredients'), (decorate_node, 'ingredients'),
} 

make_cake_pipeline = Pipeline(
    nodes, 
    edges, 
    get_ingredients_node,
    decorate_node,
)

input = {
    'flavor': "chocolate", 
    'with_whipping_cream': True,
}

output = make_cake_pipeline.run(data)
output['cake'] # 想要的蛋糕，而且 output 只有蛋糕

從 Pipeline 的 visualize 可以看出資料流，使得錯誤容易被聚焦。儘管每個 Node 是使用 dict 作為傳入與傳出，但由於有驗證與指定 key，所以不會讓資料擴散到不必要的 Node。前同事指出可以用 Marshmallow 或 Pydantic 資料驗證以及設定不變的屬性，讓程式碼更簡單。

多年來在不同地方，經常看到有人在寫資料處理流時，用一個 dict 傳入所有的 function 。當 Node 數量變多以及資料關係複雜之後，很小的 Bug 也會花很多時間解決，甚至是幾週，對於新人來說也沒辦法在短期之內上手，因為得知道全部流程，才能診斷問題，希望可以節省各位讀者的時間。

Exploring the Pitfalls of Using a Single dict as Function Parameters in Python and Remediation Strategies

In Python software development, it’s common practice to use a single dictionary as a function parameter. While this approach speeds up development, it can also lead to decreased maintainability. This article will explore the potential pitfalls of using a single dict as function parameters and provide some corrective measures to improve code quality.

What Makes Code Good?

Good code has at least the following characteristics: clear naming conventions, understandable operation, ease of scalability, ease of maintenance, appropriate comments, and testing. These characteristics are subjective and also require experiential judgment.

In a software company, a piece of code goes through multiple stages: developers initially add and modify code, which is then reviewed by another reviewer to ensure that the functionality is correctly implemented. Reviewers may not be completely familiar with the entire codebase. In terms of outcomes, good code shortens the time for developers to add new features and also reduces the time for code review by reviewers, which is an important criterion for judging the quality of code.

Another clue is when the code has bugs, whether it’s easy to add new tests without having to write a test that involves a lot of code. In addition, the amount of code that needs to be modified when requirements change is also a criterion for judgment.

Using a Single Dictionary as Function Parameters

In Python, it’s common to use a single dictionary as function parameters, as demonstrated by the following code:

def f1(data: dict):
    print(f"data['bar']: {data['bar']}.")
    print(f"data['baz']: {data['baz']}.")

def f2(bar, baz):
    print(f"bar: {bar}.")
    print(f"baz: {baz}.")

data = {'bar': 'bar', 'baz': 1}
f1(data)
f2('bar', 1)

The use of f1 as an alternative to f2 is prevalent in Python, likely because Python is a dynamically typed language, allowing for easy storage of values of different types in dictionaries. Moreover, dictionaries are a fundamental data type in Python, making them intuitive to use from the outset. By contrast, rewriting f1 in C++ poses significant challenges:

#include <map>
#include <string>
#include <variant>
#include <cstdio>

void f1(const std::map<std::string, std::variant<std::string, int>>& data) {
    printf("data['bar']: %s.\n", std::get<std::string>(data.at("bar")).c_str());
    printf("data['baz']: %d.\n", std::get<int>(data.at("baz")));
}

void f2(const std::string& bar, int baz) {
    printf("bar: %s.\n", bar.c_str());
    printf("baz: %d.\n", baz);
}

int main() {
    std::map<std::string, std::variant<std::string, int>> data = {
      {"bar", std::string("bar")}, 
      {"baz", 1}
    };
    
    f1(data);
    f2("bar", 1);
    return 0;
}

In C++, f1 is more challenging to write, and preparing the data variable is also more complex. However, does Python's inherent advantage with dictionaries lead to any negative effects? Let's explore this with examples from machine learning involving large numbers of hyperparameters and data processing flows, which can consume months of a team's time due to suboptimal coding practices.

Machine Learning: Dealing with a Large Number of Hyperparameters

In machine learning code, dealing with a large number of hyperparameters is a common challenge. Since specifying each parameter in every function call can be time-consuming, many developers opt to encapsulate all parameters within a dictionary. Here’s an example:

class SimpleTrainer:
    def __init__(self, config: dict):
        self.config = config
    def fit(self):
        for epoch_idx in range(self.config['max_epochs']):
            save_model(self.config['ckpt_path'])

config = load_config(config_path)
trainer = SimpleTrainer(config)
config['model'] = load_model(model_path)
config['train_dataloaders'] = get_train_dataloaders()
config['val_dataloaders'] = get_val_dataloaders()
config['ckpt_path'] = ckpt_path

trainer.fit()

While SimpleTrainer seems clean and concise at first glance, it raises three issues:

If the config parameter is missing essential keys, errors may be deferred. For instance, if ckpt_path is omitted, the program may run for hours before failing to save the model.
If config contains too many parameters, it can lead to maintenance issues. Unused parameters may accumulate over time, causing confusion for reviewers.
fit() does not explicitly pass any parameters, relying on config implicitly. Developers need to read the entire fit function to understand the required parameters.

A better approach for maintainability is to explicitly list the hyperparameters and pass them to the fit method:

class Trainer:
    def __init__(
        self,
        accelerator="auto",
        strategy="auto",
        devices="auto",
        max_epochs=None,
        min_epochs=None,
        max_steps=-1,
        min_steps=None,
        max_time=None,
        limit_train_batches=1.0,
        limit_val_batches=1.0,
        limit_test_batches=1.0,
        precision=None,
        logger=None,
        callbacks=None,
        fast_dev_run=False,
    ):
        pass 

   def fit(
        self,
        model, 
        train_dataloaders=None, 
        val_dataloaders=None, 
        datamodule=None, 
        ckpt_path=None,
    ):
        pass

Some argue that this approach requires specifying many parameters when creating a Trainer instance, leading to maintenance issues if the __init__ definition changes. However, using dictionary unpacking (**) simplifies the process:

config = load_config(config_path)
trainer = Trainer(**config)

This approach keeps the code concise and immediately flags errors if config contains missing or extra parameters. While the SimpleTrainer approach is common in machine learning repositories on platforms like GitHub, it can lead to significant time wastage in multi-developer environments with longer lifecycles. Balancing development speed with maintainability is crucial in Python development. It's advisable to switch to the Trainer approach before the first code review, or even from the outset for experienced developers.

Data Processing Flow: Avoiding Dict Chaos

n data processing flows, using a single dict as the input parameter can lead to confusion. This situation may be gradual and unconscious. Take cake making as an example:

def make_cake(flavor, with_whipping_cream):
    # Obtain necessary ingredients
    ingredients = get_ingredients(flavor, with_whipping_cream)
    
    # Mix batter
    mixed_batter = mix_batter(ingredients)
    
    # Bake the cake
    raw_cake = bake(mixed_batter)
    
    # Decorate the cake
    cake = decorate(raw_cake, ingredients)
    
    return cake

make_cake("chocolate", True)

This code is fine; each function’s input and output are clear. However, due to requirements such as monitoring, detection, rollback, pausing, network transmission, or framework constraints, each function in the cake-making process is encapsulated into a class to implement these shared functionalities through a unified interface:

class Node():
    def run(self, input: dict) -> dict:
        raise NotImplementedError   def load(self, filepath):
        pass 
   def save(self, filepath):
        pass
    
    def rollback(self, filepath):
        # Rollback from errors
        pass

class GetIngredientsNode(Node):
    def run(self, input: dict) -> dict:
        # Implementation
        pass
class Pipeline():
    def __init__(self, node_list):
        self.node_list = node_list
    def run(self, data: dict):
        for node in node_list:
            data = node.run(data)
        return data

make_cake_pipeline = Pipeline([
    GetIngredientsNode(),
    MixBatterNode(),
    BakeNode(),
    DecorateNode(),
])
data = {
    'flavor': "chocolate", 
    'with_whipping_cream': True,
}
data = make_cake_pipeline.run(data)
data['cake'] # The desired cake
"""
data = {
    'flavor': "chocolate", 
    'with_whipping_cream': True,
    'ingredients': ingredients,
    'raw_cake': raw_cake,
    'cake': cake
}
"""

So, the Pipeline and Node classes were designed, and GetIngredientsNode, MixBatterNode, BakeNode, and DecorateNode were implemented to achieve rollback functionality. Data carries all the messages throughout the entire process, similar to a global variable.

However, this leads to teams wasting a significant amount of time. For instance, if an error occurs during the execution of DecorateNode’s run function due to one piece of data in the data dict, it would require traversing through the previous Nodes until finding where it was written. Additionally, it’s possible that a Node might inadvertently overwrite a key unrelated to it, necessitating a review of all Nodes. Furthermore, errors are deferred until a Node is actually executed, even if it’s just a data transmission error, all of which violate the principles of good code.

If the relationship between each Node’s data can be clearly described without implicitly relying on the data dict, a modern version of a global variable passed across multiple Nodes, debugging would be much simpler. Please refer to the following modified code:

class Node():
    def __init__(self, is_source=False, is_target=False):
        self.input = {}
        self.output = {}
        self._init_input_keys()
        self._init_output_keys()

    def _init_input_keys(self):
        # Define input format
        raise NotImplementedError

    def _init_output_keys(self):
        # Define output format

        raise NotImplementedError
    def verify_input(self):
        # Avoid users inserting irrelevant input data
        for key in self.input:
            if key not in self.input_keys:
                raise ValueError
        # Check type ...

    def verify_output(self):
        # Avoid users inserting irrelevant output data
        for key in self.output:
            if key not in self.output_keys:
                 raise ValueError
        # Check type ...

    def set_input(sef):
        # Stop immediately if incorrect data is received
        self.input = data
        self.verify_input(data)
        
    def set_output(self, data: dict):
        # Stop immediately if incorrect data is received
        self.output = data
        verify_input(data)
    def run_imp(self, read_only_input: dict) -> dict:
        # Implementation required
        # Specify that read_only_input content cannot be modified
        raise NotImplementedError

    def run(self):
        # Cannot be modified to ensure data validation
        self.verify_input()
        self.output = self.run_imp(self.input)
        self.verify_output()

class GetIngredientsNode(Node):
    def _init_input_keys(self):
        self.input_keys = {
            'flavor': Flavor,
        }
    def _init_output_keys(self):
        self.output_keys = {'ingredients': Ingredients}
    def process_imp(self, read_only_input) -> dict:
        # Process of obtaining ingredients
        output['ingredients': Ingredients]
        return output

With these modifications, the new Node performs input and output validation before running, preventing errors from occurring only during execution.

class Pipeline:
    def __init__(self, nodes, edges, source_node, target_node):
        self.nodes = nodes
        self.source_node = source_node
        self.target_node = target_node
        self.edges = edges

        # Avoid infinite loop execution
        self.check_no_cycle(edges) 

        # Ensure no missing or extra input/output data
        self.check_link_fullfilled(edges) 

        # Ensure only one source and one destination node
        self.check_single_src_dst(edges)

    def run(self, input):
        res = None
        
        # Determine execution order based on data dependencies and source node
        order = self.get_run_order()
        for node in order:
            if node is self.source_node:
                node.set_input(input)    
       
            # Get already calculated data
            input = self.get_input(node)
            output = node.process(input)
            node.set_output(node, output)

            if node is self.target_node:
                res = output
        return res

    def load(self, filepath):
        pass

    def save(self, filepath):
        pass
    
    def rollback(self, filepath):
        pass

    def visualize(self, filepath):
        # Draw data dependency graph
        pass

get_ingredients_node = GetIngredientsNode()
mix_batter_node = MixBatterNode()
bake_node = BakeNode()
decorate_node = DecorateNode()

nodes = [
    get_ingredients_node, 
    mix_batter_node,
    bake_node,
    decorate_node,
]

# Construct data relationships
edges = {
  (id(get_ingredients_node), 'flavor'): (id(mix_batter_node), 'flavor'),
  (id(get_ingredients_node), 'with_whipping_cream'): (id(mix_batter_node), 'with_whipping_cream'), 
  (id(mix_batter_node), 'mixed_batter'): (id(bake_node), 'mixed_batter'),
  (id(bake_node), 'raw_cake'): (id(decorate_node), 'raw_cake'),
  (id(get_ingredients_node), 'ingredients'), (id(decorate_node), 'ingredients'),
} 

make_cake_pipeline = Pipeline(
    nodes, 
    edges, 
    get_ingredients_node,
    decorate_node,
)

input = {
    'flavor': "chocolate", 
    'with_whipping_cream': True,
}

output = make_cake_pipeline.run(data)
output['cake'] # Desired cake, and only the cake in output

You can see the data flow from the Pipeline's visualization, making errors easy to focus on. Although each Node uses a dict for input and output, validation and specific key assignments prevent data from spreading to unnecessary Nodes. A former colleague pointed out that using Marshmallow or Pydantic for data validation and setting immutable properties can make the code simpler.

Over the years, in various places, it's common to see people passing a single dict containing all the function's inputs while writing data processing flows. However, as the number of Nodes and data relationships become more complex, even small bugs can take a lot of time to resolve, sometimes weeks. It's also challenging for newcomers to quickly understand the entire process to diagnose issues. I hope this saves time for all readers.