Learnings of the Poor

When hardware resources are scarce, understanding memory-efficient patterns becomes essential. This post explores practical optimization techniques using iterators and streaming pipelines in Python, demonstrating how to process large files without loading them entirely into memory.

Wayne Lau

  ·  6 min read

Learnings of the Poor #

Necessity is the mother of invention

I was already GPU poor, but a recent job change combined with rising component prices have also made me RAM and NVMe poor.

While I am nowhere close to the experts of optimisations in the early 2000s or 90s, I took this time to brush up on some fundamentals and key concepts in Python. As the saying goes:

“Premature optimisation is the root of all evil”

We are not looking for very deep level optimisations, these changes aim to follow the Pareto Principle where 80% of the outcome comes from 20% of the effort. The changes below may or may not be 20% effort but I would consider them low-effort.

As such, there won’t be any discussion on performance profiling, where we are determining hot loops, cache misses, memory reallocations etc.

Iterators #

Frankly I think this is an important concept that has a great carryover regardless of languages. Understanding iterators also helps if you need to think of channels, which is very important in Go.

import json

# Example stubs for clarity
def is_good(record: dict) -> bool:
    """Return True if record passes filtering criteria"""
    return True

def process_data(record: dict) -> dict:
    """Transform a single record and return modified version"""
    return record

# for all the code below, list comprehension can be used, but i am using loops
# for readability
def read_file(file: str) -> list[dict]:
    data = []
    with open(file, "r") as f:
        for line in f:
            data.append(json.loads(line))
    return data
# assume that maybe first_filter is your processing pipeline
# is_good is a function that returns bool, we want to keep true
def first_filter(input_data: list[dict]) -> list[dict]:
    result = []
    for record in input_data:
        if is_good(record):
            result.append(record)
    return result
# process_data takes in a dict and passes out a dict, usually it's different but just for examples
def second_processing(input_data: list[dict]) -> list[dict]:
    result = []
    for record in input_data:
        result.append(process_data(record))
    return result
    
def write_processed_data(input_data: list[dict], output_file: str):
    with open(output_file, 'w') as f:
        for record in input_data:
            f.write(json.dumps(record) + '\n')
# you would execute it like so

data = read_file("data.jsonl")
data = first_filter(data)
data = second_processing(data)
write_processed_data(data, "output.jsonl")

The issue with the above code is that you are collecting the results at every stage. It contributes to memory growth. If the raw size of data.jsonl is bigger than your RAM, you are going to run OOM very fast. Even if it’s for a small dataset, for learning purposes it may help you to learn the yield keyword. It also forces you to think in terms of pipelines. For the reference on typing Iterator. On Python 3.9+, prefer from collections.abc import Iterator over typing.Iterator.

import json

def read_file(file: str) -> Iterator[dict]:
    with open(file, "r") as f:
        for line in f:
            yield json.loads(line)
# assume that maybe first_filter is your processing pipeline
# is_good is a function that returns bool, we want to keep true
def first_filter(input_data: Iterator[dict]) -> Iterator[dict]:
    for record in input_data:
        if is_good(record):
            yield record

# process_data takes in a dict and passes out a dict, usually it's different but just for examples
def second_processing(input_data: Iterator[dict]) -> Iterator[dict]:
    for record in input_data:
        yield process_data(record)

def write_processed_data(input_data: Iterator[dict], output_file: str):
    with open(output_file, 'w') as f:
        for record in input_data:
            f.write(json.dumps(record) + '\n')
            
# note that the data types are Iterator[dict] here
data = read_file("data.jsonl")
data = first_filter(data)
data = second_processing(data)
write_processed_data(data, "output.jsonl")

In the above code, your memory usage is significantly lowered.

Caveats #

  • One big caveat is the need for a stable read and write, because the files are held open. Unintentional edits or moves will break the pipeline.
  • When writing JSONL, json.dumps does not add a trailing newline, so f.write(json.dumps(record) + '\n') is intentional — each JSON object needs to be on its own line.

Learning points #

This might be my opinion but I find that iterators are a step before understanding pipelines, channels or pub sub patterns. When you understand iterators, you also understand the bottlenecks of your code. It’s probably more nuanced, but you could say they are fundamentally all iterators that consume and yield.

Say process data is slow, and it does 1 line per second, while reading, filtering and writing is quite fast, maybe 4 lines per second. So your pipeline is bounded by your lower limit of 1 line per second, what do you do?

I won’t discuss the use of Queues or Channels here as they will be going too in-depth, but the structure you want is 1 worker for read, and 4 workers for the processing.

Read-worker-1 -> Filter-worker-1 -> Process-worker-{1..4} -> Write-worker-1

Compression #

In my Compression post, I mentioned that benchmarks should be done to know whether your use case supports compressions. For write once, read many scenarios, using higher compressions values may help.

Here is an actual measurement:

IO constrained #

Realistically you wouldn’t have a slow network or drive for server workloads. Here I have a situation where I need to process or read a jsonl file that is on a NAS. The code is very simple, it just reads per line and does json.loads

ZST: 100000it [00:05, 17220.01it/s]
ZST: 5.85s (9.47 MB/s)
Raw: 100000it [00:40, 2492.39it/s]
Raw: 40.14s (11.15 MB/s)

After multiple runs the throughput in MB/s should stabilise, but the main thing to note is that because data is compressed, you can read more data per buffer. That is, more lines are stored per MB of compressed jsonl compared to its raw form.

Less is more #

This applies to all programming languages, but less work means more efficient processing. Doesn’t mean always adding a cache everywhere helps, it’s about eliminating wasted work.

Using the example of the pipeline above, you can maximise throughput per function with highly optimised code, but if you filter after you process, perf or flamegraph won’t catch that.

picture this

process takes 5s per line
filter takes 1s per line

lets assume that its sequential like the above code we had and the read and write process takes 0 time

At 10000 lines, process -> filter = 10000 * 6s = 60000s
Assuming 50% of data is bad, filter-> process takes:

10000 * 1s + 5000 * 5s = 35000s
  • No complex code, no need for compiled languages. Usually rewriting into another language is the last thing you would do.
  • Algorithmic complexity matters as well. Choosing the right data structure — a set for membership checks instead of a list, a deque instead of a list for queue operations — can eliminate entire classes of wasted work regardless of language.