Core Concepts
The mental model behind Brokoli -- pipelines, nodes, edges, runs, connections, and variables.
The mental model behind Brokoli: pipelines, nodes, edges, runs, connections, and variables.
Pipeline
A pipeline is a directed acyclic graph (DAG) of nodes connected by edges. It describes a data flow from sources through transformations to sinks.
Each pipeline has:
- Name and description
- Nodes -- the processing steps
- Edges -- the connections between nodes
- Schedule -- optional cron expression
- Parameters -- default key-value pairs passed to nodes at runtime
- Tags -- labels for filtering and grouping
- Hooks -- lifecycle webhooks (on_start, on_success, on_failure)
Node
A node is a single step in a pipeline. Each node has a type that determines what it does, and a config object with type-specific settings.
Node types at a glance
| Type | Category | Description |
|---|---|---|
source_file | Source | Read CSV, JSON, Parquet, or Excel files |
source_api | Source | HTTP GET/POST to a REST API |
source_db | Source | SQL query against a database |
transform | Transform | Rename, filter, add columns, aggregate, sort |
code | Transform | Run custom Python scripts |
join | Transform | Join two datasets on a key |
sql_generate | Transform | Generate SQL INSERT/UPSERT from data |
quality_check | Logic | Assert data quality rules |
condition | Logic | Branch execution based on expressions |
sink_file | Sink | Write to CSV, JSON, Parquet files |
sink_db | Sink | Insert/upsert rows into a database |
sink_api | Sink | POST data to a REST API |
migrate | Operation | Run database migrations |
See Node Types for full details on each type.
Edge
An edge connects one node's output to another node's input. Edges define the data flow and execution order. A node only executes after all its upstream nodes complete.
{"from": "source1", "to": "transform1"}A node can have multiple incoming edges (e.g., a join node receives two datasets) and multiple outgoing edges (fan-out).
Run
A run is a single execution of a pipeline. Each run tracks:
| Field | Description |
|---|---|
id | Unique run identifier |
status | pending > running > success / failed / cancelled |
started_at | When execution began |
finished_at | When execution completed |
node_runs | Per-node status, duration, and row counts |
error | Error message from the first failed node |
Status lifecycle
stateDiagram-v2
[*] --> pending
pending --> running
running --> success
running --> failed
running --> cancelled
failed --> running : resumeResume: Failed runs can be resumed from the first failed node. Nodes that already succeeded are skipped.
Connections
A connection stores credentials for an external system (database, API, SFTP, S3). Passwords and secrets are encrypted at rest using AES-256-GCM.
Supported connection types:
| Type | Used for |
|---|---|
postgres | PostgreSQL databases |
mysql | MySQL databases |
sqlite | SQLite databases |
http | REST APIs |
sftp | SFTP/SSH file transfers |
s3 | Amazon S3 buckets |
generic | Any TCP-reachable service |
Reference connections in node config: "conn_id": "my-postgres" -- the engine resolves the connection ID to a URI at runtime.
Variables
Variables are key-value pairs available to all pipelines. Two types:
- string -- plaintext, visible in the UI
- secret -- encrypted at rest, masked in responses
Reference in node config: ${var.my_variable}
Architecture
Brokoli is a single Go binary with three internal components:
broked serve
├── REST API + WebSocket (chi router, JWT auth)
├── Scheduler (cron, timezone-aware)
└── Engine (DAG runner, Kahn's algorithm)
└── SQLite or PostgreSQLExecution model
The engine uses Kahn's algorithm for topological sorting and executes nodes in waves:
- Find all nodes with no incoming edges (in-degree = 0)
- Execute them in parallel (up to 4 concurrent by default)
- When a wave completes, decrement in-degree of downstream nodes
- Repeat until all nodes are processed
This means independent branches of your DAG run in parallel automatically.
Concurrency
- Max concurrent runs: 4 (configurable via
BROKOLI_MAX_CONCURRENT_RUNS) - Max parallel nodes per run: 4
- Node timeout: 30 minutes (configurable per node via
timeoutconfig key) - Retry: Exponential backoff with
max_retriesandretry_delayper node
Workspaces (Enterprise)
Workspaces provide tenant isolation within a single Brokoli instance. Each workspace has its own pipelines, connections, and variables.
Next steps
- Pipeline Overview -- deep dive into pipeline execution
- Scheduling -- cron, timezones, and dependencies
- Python SDK -- define pipelines as code