Brokoli
Connections

Files (CSV, JSON, Parquet)

Read and write local files in various formats including CSV, JSON, Parquet, and Excel.

Read and write local files in various formats.

Supported formats

FormatExtensionReadWrite
CSV.csvYesYes
JSON.jsonYesYes
Parquet.parquetYesYes
Excel.xlsx, .xlsYesNo

Reading files (source_file)

{
  "type": "source_file",
  "config": {
    "path": "/data/sales/2024-01.csv",
    "format": "csv"
  }
}

CSV options

KeyDefaultDescription
delimiter,Field separator
has_headertrueFirst row is column names
{
  "path": "/data/export.tsv",
  "format": "csv",
  "delimiter": "\t",
  "has_header": true
}

JSON files

JSON files should contain an array of objects:

[
  {"id": 1, "name": "Alice"},
  {"id": 2, "name": "Bob"}
]

Or a single object with a data array (use json_path to extract):

{
  "path": "/data/response.json",
  "format": "json",
  "json_path": "results"
}

Parquet files

{
  "path": "/data/events.parquet",
  "format": "parquet"
}

Excel files

{
  "path": "/data/report.xlsx",
  "format": "excel"
}

Writing files (sink_file)

{
  "type": "sink_file",
  "config": {
    "path": "/data/output/users.csv",
    "format": "csv"
  }
}

Dynamic file paths

Use pipeline parameters for date-partitioned output:

{
  "path": "/data/output/events_${param.date}.csv",
  "format": "csv"
}

File security

By default, Brokoli can read and write files anywhere the process has access. Restrict this with the BROKOLI_DATA_DIRS environment variable:

export BROKOLI_DATA_DIRS=/data:/tmp
broked serve

When set, file nodes can only access paths under the listed directories. Any attempt to read or write outside these directories fails with an error.

Warning: Always set BROKOLI_DATA_DIRS in production to prevent pipelines from accessing sensitive files on the server.

Using with connections

Files don't require connections. However, for remote files (SFTP, S3), create a connection and use it in the node config:

  • SFTP: Create an sftp connection, then reference it in source/sink nodes
  • S3: Create an s3 connection with bucket, region, and credentials in the extra field

Performance

For large files:

  • CSV: Streaming reader, low memory usage
  • JSON: Loaded into memory -- consider splitting large JSON files
  • Parquet: Columnar format, efficient for large datasets
  • Code nodes with large datasets (>10K rows) automatically use file-based transfer for 5-10x faster processing