Making Compression a Habit with zstd
Practical guide to using zstd compression in Python and Linux for data processing, log management, and reducing storage and transfer costs.
· 4 min read
With zstd being added to Python 3.14, I’ve been using compressed files more often in my workflow. Here’s what I’ve learned about making compression a habit.
Python Data Processing with Compression #
Reading and writing JSONL (JSON Lines) files with Zstandard compression:
Pre 3.14: #
Install with: uv add zstandard
import zstandard as zstd
import json
import io
# Writing compressed JSONL with Zstandard
data = [
{"id": 1, "name": "Alice", "score": 95},
{"id": 2, "name": "Bob", "score": 87},
{"id": 3, "name": "Charlie", "score": 92}
]
# Write
with open('data.jsonl.zst', 'wb') as f:
cctx = zstd.ZstdCompressor(level=3)
with cctx.stream_writer(f) as writer:
for record in data:
line = (json.dumps(record) + '\n').encode('utf-8')
writer.write(line)
# Read
with open('data.jsonl.zst', 'rb') as f:
dctx = zstd.ZstdDecompressor()
with dctx.stream_reader(f) as reader:
text_stream = io.TextIOWrapper(reader, encoding='utf-8')
for line in text_stream:
record = json.loads(line)
print(record)Python 3.14+ #
# Simple example - you'll handle more lines and errors in production
from compression import zstd
import json
# Read and print first record
with zstd.open('data.jsonl.zst', "rt") as f:
for line in f:
data = json.loads(line)
print(data)
break # Remove break to read all lines
# Write
with zstd.open('data.jsonl.zst', "wt") as f:
f.write(json.dumps(data) + "\n")Key points:
- Use
'wt'mode for writing text (or'wb'for binary) - Use
'rt'mode for reading text (or'rb'for binary) - The zstd API is similar to regular
open()- just usezstd.open()instead ofopen() - Typical compression ratio: 6-7x size reduction, for
zstd-3
Benchmarking Your Workload #
You should benchmark compression according to your workload to determine your trade-offs.
For archival of logs or long-term storage, you can use higher compression levels of zstd. Archives like Pushshift Reddit typically use level 22. For most use cases, zstd-3 is a good default.
Working with Compressed Files #
Zstd includes tools for viewing, searching, and processing compressed files without manual decompression.
Viewing Compressed Files #
# View compressed data file
zstdcat data.json.zst
# Page through compressed file
zstdless data.json.zst
# View last N lines
zstdcat data.jsonl.zst | tail -n 50Searching Through Compressed Data #
# Search for pattern in compressed file
zstdgrep "error" events.json.zst
# Search across multiple compressed files
zstdgrep "user_id" *.json.zst
# Case-insensitive search with line numbers
zstdgrep -in "warning" logs/*.zst
# Count occurrences
zstdgrep -c "timeout" events.json.zstCombining with Other Tools #
# Extract specific fields with awk
zstdcat data.jsonl.zst | awk '{print $1, $7}' | sort | uniq -c
# Parse JSON data
zstdcat events.json.zst | grep ERROR | jq '.timestamp, .message'
# Count unique values
zstdcat data.jsonl.zst | awk '{print $1}' | sort -u | wc -lTransferring files #
Rsync #
Many of us use rsync daily. The -z flag compresses data during transfer. On highly compressible files, you may notice rsync reporting speedup > 1.0x. This means the data sent and received is much less than the original file size. Here’s a test with about 66GB of jsonl files:
sent 50,322 bytes received 12,451,737,167 bytes 19,290,143.28 bytes/sec
total size is 66,857,841,487 speedup is 5.37While this might seem small, it’s highly beneficial if you’re network bound or concerned about egress costs.
S3 and Cloud Storage #
AWS charges for outbound data transfer (egress). Compressing data before storage can significantly reduce these costs. With a 7.0x compression ratio, a $14,000 egress bill drops to roughly $2,000. Another benefit is faster transfer speeds. While storing compressed files is recommended, you can also compress and decompress on the fly during transfer, for example, when you want to read JSON files in an IDE.
Here’s an example from a gigabit connection. Servers will be even faster:
Uploading
# data.jsonl is a 4G file of JSON lines
ls -la data.jsonl
4285772892 Jan 5 20:12 data.jsonl
# pv shows progress (may not be completely accurate)
# command compresses, keeps original, and pipes to S3
time zstd -k -c data.jsonl | pv | s5cmd pipe s3://my-bucket/data.zst
363MiB 0:00:05 [71.3MiB/s]
real 0m8.139s
user 0m13.164s
sys 0m2.827s
s5cmd du -H s3://my-bucket/data.zst
363.7M bytes in 1 objects: s3://my-bucket/data.zst
# Copying without compression for comparison
time s5cmd cp data.jsonl s3://my-bucket/data.jsonl
real 0m57.547s
user 0m17.131s
sys 0m9.398s
s5cmd du -H s3://my-bucket/data.jsonl
4.0G bytes in 1 objects: s3://my-bucket/data.jsonlThe same works for downloading:
Downloading
# Decompress during download (compression takes longer than decompression)
time s5cmd cat s3://my-bucket/data.zst | pv | zstd -d > data.jsonl
363MiB 0:00:06 [52.7MiB/s]
real 0m6.903s
user 0m4.023s
sys 0m3.880s
# Direct copy would be similar time unless you have different upload and download speeds
# time s5cmd cp s3://my-bucket/data.jsonl data.jsonl
# Verify the file
diff -s data.jsonl original_data.jsonl
Files data.jsonl and original_data.jsonl are identicalConclusion #
I use zstdcat to read files and rarely need to edit them in an IDE. This habit cut my text storage by up to 80%. There’s a balance between convenience, speed, and storage - and this works for me. More optimized formats like protobuf or arrow exist, but most text processing still uses JSON.