Files (CSV, JSON, Parquet)
Read and write local files in various formats including CSV, JSON, Parquet, and Excel.
Read and write local files in various formats.
Supported formats
| Format | Extension | Read | Write |
|---|---|---|---|
| CSV | .csv | Yes | Yes |
| JSON | .json | Yes | Yes |
| Parquet | .parquet | Yes | Yes |
| Excel | .xlsx, .xls | Yes | No |
Reading files (source_file)
{
"type": "source_file",
"config": {
"path": "/data/sales/2024-01.csv",
"format": "csv"
}
}CSV options
| Key | Default | Description |
|---|---|---|
delimiter | , | Field separator |
has_header | true | First row is column names |
{
"path": "/data/export.tsv",
"format": "csv",
"delimiter": "\t",
"has_header": true
}JSON files
JSON files should contain an array of objects:
[
{"id": 1, "name": "Alice"},
{"id": 2, "name": "Bob"}
]Or a single object with a data array (use json_path to extract):
{
"path": "/data/response.json",
"format": "json",
"json_path": "results"
}Parquet files
{
"path": "/data/events.parquet",
"format": "parquet"
}Excel files
{
"path": "/data/report.xlsx",
"format": "excel"
}Writing files (sink_file)
{
"type": "sink_file",
"config": {
"path": "/data/output/users.csv",
"format": "csv"
}
}Dynamic file paths
Use pipeline parameters for date-partitioned output:
{
"path": "/data/output/events_${param.date}.csv",
"format": "csv"
}File security
By default, Brokoli can read and write files anywhere the process has access. Restrict this with the BROKOLI_DATA_DIRS environment variable:
export BROKOLI_DATA_DIRS=/data:/tmp
broked serveWhen set, file nodes can only access paths under the listed directories. Any attempt to read or write outside these directories fails with an error.
Warning: Always set
BROKOLI_DATA_DIRSin production to prevent pipelines from accessing sensitive files on the server.
Using with connections
Files don't require connections. However, for remote files (SFTP, S3), create a connection and use it in the node config:
- SFTP: Create an
sftpconnection, then reference it in source/sink nodes - S3: Create an
s3connection with bucket, region, and credentials in theextrafield
Performance
For large files:
- CSV: Streaming reader, low memory usage
- JSON: Loaded into memory -- consider splitting large JSON files
- Parquet: Columnar format, efficient for large datasets
- Code nodes with large datasets (>10K rows) automatically use file-based transfer for 5-10x faster processing