How to Use Pickle Files in Python for Saving Models, States, and Long-Running Jobs ?

Introduction

In Python, a pickle file (.pkl or .pickle) is a way to serialize and save Python objects to disk — that means turning Python objects (like lists, dictionaries, models, DataFrames, etc.) into a byte stream that can later be deserialized (loaded back into memory exactly as they were).

What “Pickling” Means

  • Pickling → converting a Python object into a byte stream (so it can be saved to a file or sent over a network).
  • Unpickling → converting the byte stream back into the original Python object.

Why It’s Useful ?

You can use pickle to:

  • Save trained machine learning models (e.g., scikit-learn, XGBoost).
  • Store preprocessed data or intermediate results between runs.
  • Cache complex Python objects (e.g., dictionaries, lists, pandas DataFrames).

Example Usage

Saving (Pickling) an Object**

1
2
3
4
5
6
7
import pickle

data = {'name': 'Benjamin', 'scores': [85, 92, 78]}

# Save to file
with open('data.pkl', 'wb') as f:
    pickle.dump(data, f)

'wb' means “write binary”.

Loading (Unpickling) the Object**

1
2
3
4
5
6
7
8
import pickle

# Load from file
with open('data.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

print(loaded_data)
# {'name': 'Benjamin', 'scores': [85, 92, 78]}

'rb' means “read binary”.

Important Notes

  1. Security warning:
    Never unpickle data from untrusted sources, because pickle.load() can execute arbitrary Python code (it’s not sandboxed).

  2. Compatibility:
    Pickled files are Python-version dependent. A file pickled in Python 3.11 may not load perfectly in Python 3.9, depending on the object.

  3. Alternatives:

  4. For safe and interoperable storage → use json, csv, or parquet.

  5. For machine learning models → use frameworks’ own methods (joblib, torch.save, tf.keras.models.save).

Basic Examples of Pickle File Usage

Example 1: Pickling a pandas DataFrame

Let’s say you have a DataFrame you want to save and reload later:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pandas as pd
import pickle

# Create a DataFrame
df = pd.DataFrame({
    'city': ['New York', 'Paris', 'Tokyo'],
    'temperature': [22, 19, 25]
})

# --- Save (pickle) the DataFrame ---
with open('weather.pkl', 'wb') as f:
    pickle.dump(df, f)

# --- Load (unpickle) it back ---
with open('weather.pkl', 'rb') as f:
    df_loaded = pickle.load(f)

print(df_loaded)

Output:

1
2
3
4
city  temperature
0  New York           22
1     Paris           19
2     Tokyo           25

Note: pandas also has its own convenience method:

1
2
df.to_pickle('weather.pkl')
df_loaded = pd.read_pickle('weather.pkl')

That’s just a wrapper around the same pickle mechanism.

Example 2: Pickling a Trained Machine Learning Model

Suppose you trained a scikit-learn model and want to reuse it without retraining:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import pickle

# Load example data
X, y = load_iris(return_X_y=True)

# Train model
model = RandomForestClassifier()
model.fit(X, y)

# --- Save (pickle) model ---
with open('rf_model.pkl', 'wb') as f:
    pickle.dump(model, f)

# --- Load it back ---
with open('rf_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# Verify it's working
print(loaded_model.predict([[5.1, 3.5, 1.4, 0.2]]))

Output example:

1
[0]

So your trained model is restored exactly as before — ready to predict again.

Tip: Use joblib for large models

joblib is similar to pickle but more efficient for big NumPy arrays:

1
2
3
4
import joblib

joblib.dump(model, 'rf_model.joblib')
model_loaded = joblib.load('rf_model.joblib')

Pickle multiple objects together

Here’s the cleanest and safest way to pickle multiple objects together in a single file:
Put everything into one dictionary, and pickle the dictionary.

This ensures you keep:

  • your model
  • your scaler
  • feature names
  • any metadata (means, thresholds, timestamps, etc.)

all together in one .pkl file.

Example: Pickling Multiple Objects in One File

Suppose you have:

  • a trained model
  • a StandardScaler
  • a list of feature names
  • some metadata

Here’s the full code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# Fake training for example
X = np.random.rand(100, 3)
y = np.random.rand(100)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = RandomForestRegressor()
model.fit(X_scaled, y)

feature_names = ['feat1', 'feat2', 'feat3']
metadata = {'version': '1.0', 'created_by': 'Benjamin'}

# Pack everything in a dictionary
bundle = {
    'model': model,
    'scaler': scaler,
    'features': feature_names,
    'metadata': metadata
}

# --- Save the whole bundle ---
with open('model_bundle.pkl', 'wb') as f:
    pickle.dump(bundle, f)

Loading Everything Back

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pickle

# --- Load the bundle ---
with open('model_bundle.pkl', 'rb') as f:
    bundle = pickle.load(f)

model = bundle['model']
scaler = bundle['scaler']
features = bundle['features']
metadata = bundle['metadata']

print(metadata)
print(features)

Using the Loaded Model

1
2
3
4
5
6
7
8
9
# Example new data point
new_data = np.array([[0.2, 0.5, 0.9]])

# Apply scaler (same scaler used before)
new_data_scaled = scaler.transform(new_data)

# Predict
pred = model.predict(new_data_scaled)
print(pred)

Best Practice

Bundle objects in a dictionary → version-controlled, easy to inspect.

Can also bundle:

  • encoders (LabelEncoder, OneHotEncoder)
  • imputers
  • PCA or UMAP transformers
  • thresholds for classification
  • hyperparameters
  • training metrics

Avoid pickling from untrusted sources

Security issue: pickle can run arbitrary code.

Checkpointing long-running loops

You have a long loop and You want to save progress periodically so that if an error or crash happens, you can restart from the last checkpoint.

Pickle works perfectly for this because it can serialize almost any Python object.

Example 1 — Save Progress Every N Iterations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pickle
import os

progress_file = "progress.pkl"

# Try to load previous progress
if os.path.exists(progress_file):
    with open(progress_file, "rb") as f:
        data = pickle.load(f)
        start_index = data["index"]
        results = data["results"]
    print(f"Resuming from iteration {start_index}...")
else:
    start_index = 0
    results = []

# Long loop
for i in range(start_index, 1000000):
    try:
        # --- Your computation here ---
        result = i ** 2  # Example

        results.append(result)

        # Checkpoint every 1000 iterations
        if i % 1000 == 0:
            with open(progress_file, "wb") as f:
                pickle.dump({"index": i + 1, "results": results}, f)
            print(f"Checkpoint saved at iteration {i}")

    except Exception as e:
        print("Error occurred:", e)
        break

What this does:

  • If the script crashes, you can re-run it.
  • It automatically loads the last saved checkpoint and resumes from there.
  • No need to redo previous work.

Example 2 — Save Progress After Each Iteration (for expensive tasks)

If each iteration takes minutes, save after every iteration:

1
2
with open("progress.pkl", "wb") as f:
    pickle.dump({"index": i + 1, "results": results}, f)

This is slower but guarantees minimal work lost.

Example 3 — Save Only the Last Completed Element

If results get too big, you can save just the last state:

1
2
3
4
5
state = {
    "i": i,
    "partial_result": partial_object,
}
pickle.dump(state, open("checkpoint.pkl", "wb"))

Example 4 — Checkpointing in Case of Errors Only (Try/Except)

1
2
3
4
5
6
7
try:
    for i in range(start_index, N):
        do_work()
except Exception as e:
    print("Error detected, saving checkpoint...")
    pickle.dump({"i": i, "results": results}, open("progress.pkl", "wb"))
    raise e

Reloading After a Failure

1
2
3
4
5
with open("progress.pkl", "rb") as f:
    data = pickle.load(f)

start_i = data["i"]
results = data["results"]

Then restart your loop from start_i.

What to Save?

Depending on your workflow, you can pickle:

  • current loop index
  • accumulated results
  • a cache of processed items
  • model training checkpoints (if not too large)
  • downloaded files metadata
  • partially processed DataFrames
  • temporary arrays

Pickle is ideal for storing anything Python-native.

Warnings

Don’t pickle massive NumPy arrays repeatedly

Pickling 10M-element arrays every minute is slow and produces big files.
Instead:

  • save them once in .npy or .parquet
  • use pickle only for metadata (indexes, filenames)

Don’t unpickle files you didn't create

Security risk.

Reusable Checkpointing System

Here is a clean, reusable, production-ready checkpointing system you can drop into any long loop.
This avoids repeating code and works for any Python objects.

1. Universal Checkpointing Utilities

You can put this in a file like checkpoint.py and import it anywhere.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import pickle
import os

def save_checkpoint(path, data):
    """Save any Python object to a checkpoint."""
    with open(path, "wb") as f:
        pickle.dump(data, f)


def load_checkpoint(path, default=None):
    """Load a checkpoint if available, otherwise return default."""
    if os.path.exists(path):
        with open(path, "rb") as f:
            return pickle.load(f)
    return default


def delete_checkpoint(path):
    """Remove a checkpoint file."""
    if os.path.exists(path):
        os.remove(path)

These three simple functions handle saving, loading, and resetting checkpoints.

2. Example: Using Checkpoints in a Long Loop

This template works for loops that may take hours or days.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from checkpoint import save_checkpoint, load_checkpoint

checkpoint_path = "progress.pkl"

# Try to load old progress
state = load_checkpoint(checkpoint_path, default={"i": 0, "results": []})

start_i = state["i"]
results = state["results"]

print(f"Resuming at iteration {start_i} with {len(results)} saved results.")

N = 1_000_000  # long loop

for i in range(start_i, N):
    try:
        # ----- Your actual work -----
        value = i ** 2  # example heavy computation
        results.append(value)
        # ----------------------------------------

        # Save checkpoint every 1000 iterations
        if i % 1000 == 0:
            save_checkpoint(checkpoint_path, {"i": i + 1, "results": results})
            print(f"Checkpoint saved at iteration {i}")

    except Exception as e:
        print("ERROR OCCURRED — saving checkpoint before quitting...")
        save_checkpoint(checkpoint_path, {"i": i, "results": results})
        raise e  # re-throw for debugging

3. If your script crashes:

Just run it again and it will automatically continue:

1
Resuming at iteration 237000 with 237000 saved results.

No lost work.

Example for Large Data (better memory)

If your results list becomes huge, save only metadata:

1
save_checkpoint(checkpoint_path, {"i": i})

and save the heavy data (arrays, dfs) separately (.npy, .parquet, .csv).

References

Links Site
pickle — Python object serialization Official Python documentation for the pickle module ([Python documentation][1])
Serialization — Python Guide on pickling and object persistence Official section of Python Standard Library manual on persistence (including pickle) ([Python documentation][2])
Python Pickle Example: A Guide to Serialization & Persistence (DigitalOcean) Practical, tutorial-style explanation of how to use pickle for saving/restoring Python objects ([digitalocean.com][3])