Core Concepts

The mental model behind Brokoli -- pipelines, nodes, edges, runs, connections, and variables.

The mental model behind Brokoli: pipelines, nodes, edges, runs, connections, and variables.

Pipeline

A pipeline is a directed acyclic graph (DAG) of nodes connected by edges. It describes a data flow from sources through transformations to sinks.

Each pipeline has:

Name and description
Nodes -- the processing steps
Edges -- the connections between nodes
Schedule -- optional cron expression
Parameters -- default key-value pairs passed to nodes at runtime
Tags -- labels for filtering and grouping
Hooks -- lifecycle webhooks (on_start, on_success, on_failure)

Node

A node is a single step in a pipeline. Each node has a type that determines what it does, and a config object with type-specific settings.

Node types at a glance

Type	Category	Description
`source_file`	Source	Read CSV, JSON, Parquet, or Excel files
`source_api`	Source	HTTP GET/POST to a REST API
`source_db`	Source	SQL query against a database
`transform`	Transform	Rename, filter, add columns, aggregate, sort
`code`	Transform	Run custom Python scripts
`join`	Transform	Join two datasets on a key
`sql_generate`	Transform	Generate SQL INSERT/UPSERT from data
`quality_check`	Logic	Assert data quality rules
`condition`	Logic	Branch execution based on expressions
`sink_file`	Sink	Write to CSV, JSON, Parquet files
`sink_db`	Sink	Insert/upsert rows into a database
`sink_api`	Sink	POST data to a REST API
`migrate`	Operation	Run database migrations

See Node Types for full details on each type.

Edge

An edge connects one node's output to another node's input. Edges define the data flow and execution order. A node only executes after all its upstream nodes complete.

{"from": "source1", "to": "transform1"}

A node can have multiple incoming edges (e.g., a join node receives two datasets) and multiple outgoing edges (fan-out).

Run

A run is a single execution of a pipeline. Each run tracks:

Field	Description
`id`	Unique run identifier
`status`	`pending` > `running` > `success` / `failed` / `cancelled`
`started_at`	When execution began
`finished_at`	When execution completed
`node_runs`	Per-node status, duration, and row counts
`error`	Error message from the first failed node

Status lifecycle

stateDiagram-v2
    [*] --> pending
    pending --> running
    running --> success
    running --> failed
    running --> cancelled
    failed --> running : resume

Resume: Failed runs can be resumed from the first failed node. Nodes that already succeeded are skipped.

Connections

A connection stores credentials for an external system (database, API, SFTP, S3). Passwords and secrets are encrypted at rest using AES-256-GCM.

Supported connection types:

Type	Used for
`postgres`	PostgreSQL databases
`mysql`	MySQL databases
`sqlite`	SQLite databases
`http`	REST APIs
`sftp`	SFTP/SSH file transfers
`s3`	Amazon S3 buckets
`generic`	Any TCP-reachable service

Reference connections in node config: "conn_id": "my-postgres" -- the engine resolves the connection ID to a URI at runtime.

Variables

Variables are key-value pairs available to all pipelines. Two types:

string -- plaintext, visible in the UI
secret -- encrypted at rest, masked in responses

Reference in node config: ${var.my_variable}

Architecture

Brokoli is a single Go binary with three internal components:

broked serve
├── REST API + WebSocket  (chi router, JWT auth)
├── Scheduler             (cron, timezone-aware)
└── Engine                (DAG runner, Kahn's algorithm)
    └── SQLite or PostgreSQL

Execution model

The engine uses Kahn's algorithm for topological sorting and executes nodes in waves:

Find all nodes with no incoming edges (in-degree = 0)
Execute them in parallel (up to 4 concurrent by default)
When a wave completes, decrement in-degree of downstream nodes
Repeat until all nodes are processed

This means independent branches of your DAG run in parallel automatically.

Concurrency

Max concurrent runs: 4 (configurable via BROKOLI_MAX_CONCURRENT_RUNS)
Max parallel nodes per run: 4
Node timeout: 30 minutes (configurable per node via timeout config key)
Retry: Exponential backoff with max_retries and retry_delay per node

Workspaces (Enterprise)

Workspaces provide tenant isolation within a single Brokoli instance. Each workspace has its own pipelines, connections, and variables.

Next steps

Pipeline Overview -- deep dive into pipeline execution
Scheduling -- cron, timezones, and dependencies
Python SDK -- define pipelines as code

On this page