Files (CSV, JSON, Parquet)

Read and write local files in various formats including CSV, JSON, Parquet, and Excel.

Read and write local files in various formats.

Supported formats

Format	Extension	Read	Write
CSV	`.csv`	Yes	Yes
JSON	`.json`	Yes	Yes
Parquet	`.parquet`	Yes	Yes
Excel	`.xlsx`, `.xls`	Yes	No

{
  "type": "source_file",
  "config": {
    "path": "/data/sales/2024-01.csv",
    "format": "csv"
  }
}

Key	Default	Description
`delimiter`	`,`	Field separator
`has_header`	`true`	First row is column names

{
  "path": "/data/export.tsv",
  "format": "csv",
  "delimiter": "\t",
  "has_header": true
}

JSON files should contain an array of objects:

[
  {"id": 1, "name": "Alice"},
  {"id": 2, "name": "Bob"}
]

Or a single object with a data array (use json_path to extract):

{
  "path": "/data/response.json",
  "format": "json",
  "json_path": "results"
}

{
  "path": "/data/events.parquet",
  "format": "parquet"
}

{
  "path": "/data/report.xlsx",
  "format": "excel"
}

{
  "type": "sink_file",
  "config": {
    "path": "/data/output/users.csv",
    "format": "csv"
  }
}

Use pipeline parameters for date-partitioned output:

{
  "path": "/data/output/events_${param.date}.csv",
  "format": "csv"
}

By default, Brokoli can read and write files anywhere the process has access. Restrict this with the BROKOLI_DATA_DIRS environment variable:

export BROKOLI_DATA_DIRS=/data:/tmp
broked serve

When set, file nodes can only access paths under the listed directories. Any attempt to read or write outside these directories fails with an error.

Warning: Always set BROKOLI_DATA_DIRS in production to prevent pipelines from accessing sensitive files on the server.

Files don't require connections. However, for remote files (SFTP, S3), create a connection and use it in the node config:

SFTP: Create an sftp connection, then reference it in source/sink nodes
S3: Create an s3 connection with bucket, region, and credentials in the extra field

For large files:

CSV: Streaming reader, low memory usage
JSON: Loaded into memory -- consider splitting large JSON files
Parquet: Columnar format, efficient for large datasets
Code nodes with large datasets (>10K rows) automatically use file-based transfer for 5-10x faster processing