Introduction
In Python, a pickle file (.pkl or .pickle) is a way to serialize and save Python objects to disk — that means turning Python objects (like lists, dictionaries, models, DataFrames, etc.) into a byte stream that can later be deserialized (loaded back into memory exactly as they were).
What “Pickling” Means
- Pickling → converting a Python object into a byte stream (so it can be saved to a file or sent over a network).
- Unpickling → converting the byte stream back into the original Python object.
Why It’s Useful ?
You can use pickle to:
- Save trained machine learning models (e.g., scikit-learn, XGBoost).
- Store preprocessed data or intermediate results between runs.
- Cache complex Python objects (e.g., dictionaries, lists, pandas DataFrames).
Example Usage
Saving (Pickling) an Object**
1 2 3 4 5 6 7 | import pickle data = {'name': 'Benjamin', 'scores': [85, 92, 78]} # Save to file with open('data.pkl', 'wb') as f: pickle.dump(data, f) |
'wb' means “write binary”.
Loading (Unpickling) the Object**
1 2 3 4 5 6 7 8 | import pickle # Load from file with open('data.pkl', 'rb') as f: loaded_data = pickle.load(f) print(loaded_data) # {'name': 'Benjamin', 'scores': [85, 92, 78]} |
'rb' means “read binary”.
Important Notes
-
Security warning:
Never unpickle data from untrusted sources, becausepickle.load()can execute arbitrary Python code (it’s not sandboxed). -
Compatibility:
Pickled files are Python-version dependent. A file pickled in Python 3.11 may not load perfectly in Python 3.9, depending on the object. -
Alternatives:
-
For safe and interoperable storage → use
json,csv, orparquet. - For machine learning models → use frameworks’ own methods (
joblib,torch.save,tf.keras.models.save).
Basic Examples of Pickle File Usage
Example 1: Pickling a pandas DataFrame
Let’s say you have a DataFrame you want to save and reload later:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import pandas as pd import pickle # Create a DataFrame df = pd.DataFrame({ 'city': ['New York', 'Paris', 'Tokyo'], 'temperature': [22, 19, 25] }) # --- Save (pickle) the DataFrame --- with open('weather.pkl', 'wb') as f: pickle.dump(df, f) # --- Load (unpickle) it back --- with open('weather.pkl', 'rb') as f: df_loaded = pickle.load(f) print(df_loaded) |
Output:
1 2 3 4 | city temperature 0 New York 22 1 Paris 19 2 Tokyo 25 |
Note: pandas also has its own convenience method:
1 2 | df.to_pickle('weather.pkl') df_loaded = pd.read_pickle('weather.pkl') |
That’s just a wrapper around the same pickle mechanism.
Example 2: Pickling a Trained Machine Learning Model
Suppose you trained a scikit-learn model and want to reuse it without retraining:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris import pickle # Load example data X, y = load_iris(return_X_y=True) # Train model model = RandomForestClassifier() model.fit(X, y) # --- Save (pickle) model --- with open('rf_model.pkl', 'wb') as f: pickle.dump(model, f) # --- Load it back --- with open('rf_model.pkl', 'rb') as f: loaded_model = pickle.load(f) # Verify it's working print(loaded_model.predict([[5.1, 3.5, 1.4, 0.2]])) |
Output example:
1 | [0] |
So your trained model is restored exactly as before — ready to predict again.
Tip: Use joblib for large models
joblib is similar to pickle but more efficient for big NumPy arrays:
1 2 3 4 | import joblib joblib.dump(model, 'rf_model.joblib') model_loaded = joblib.load('rf_model.joblib') |
Pickle multiple objects together
Here’s the cleanest and safest way to pickle multiple objects together in a single file:
Put everything into one dictionary, and pickle the dictionary.
This ensures you keep:
- your model
- your scaler
- feature names
- any metadata (means, thresholds, timestamps, etc.)
all together in one .pkl file.
Example: Pickling Multiple Objects in One File
Suppose you have:
- a trained model
- a StandardScaler
- a list of feature names
- some metadata
Here’s the full code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import pickle from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestRegressor import numpy as np # Fake training for example X = np.random.rand(100, 3) y = np.random.rand(100) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) model = RandomForestRegressor() model.fit(X_scaled, y) feature_names = ['feat1', 'feat2', 'feat3'] metadata = {'version': '1.0', 'created_by': 'Benjamin'} # Pack everything in a dictionary bundle = { 'model': model, 'scaler': scaler, 'features': feature_names, 'metadata': metadata } # --- Save the whole bundle --- with open('model_bundle.pkl', 'wb') as f: pickle.dump(bundle, f) |
Loading Everything Back
1 2 3 4 5 6 7 8 9 10 11 12 13 | import pickle # --- Load the bundle --- with open('model_bundle.pkl', 'rb') as f: bundle = pickle.load(f) model = bundle['model'] scaler = bundle['scaler'] features = bundle['features'] metadata = bundle['metadata'] print(metadata) print(features) |
Using the Loaded Model
1 2 3 4 5 6 7 8 9 | # Example new data point new_data = np.array([[0.2, 0.5, 0.9]]) # Apply scaler (same scaler used before) new_data_scaled = scaler.transform(new_data) # Predict pred = model.predict(new_data_scaled) print(pred) |
Best Practice
Bundle objects in a dictionary → version-controlled, easy to inspect.
Can also bundle:
- encoders (LabelEncoder, OneHotEncoder)
- imputers
- PCA or UMAP transformers
- thresholds for classification
- hyperparameters
- training metrics
Avoid pickling from untrusted sources
Security issue: pickle can run arbitrary code.
Checkpointing long-running loops
You have a long loop and You want to save progress periodically so that if an error or crash happens, you can restart from the last checkpoint.
Pickle works perfectly for this because it can serialize almost any Python object.
Example 1 — Save Progress Every N Iterations
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | import pickle import os progress_file = "progress.pkl" # Try to load previous progress if os.path.exists(progress_file): with open(progress_file, "rb") as f: data = pickle.load(f) start_index = data["index"] results = data["results"] print(f"Resuming from iteration {start_index}...") else: start_index = 0 results = [] # Long loop for i in range(start_index, 1000000): try: # --- Your computation here --- result = i ** 2 # Example results.append(result) # Checkpoint every 1000 iterations if i % 1000 == 0: with open(progress_file, "wb") as f: pickle.dump({"index": i + 1, "results": results}, f) print(f"Checkpoint saved at iteration {i}") except Exception as e: print("Error occurred:", e) break |
What this does:
- If the script crashes, you can re-run it.
- It automatically loads the last saved checkpoint and resumes from there.
- No need to redo previous work.
Example 2 — Save Progress After Each Iteration (for expensive tasks)
If each iteration takes minutes, save after every iteration:
1 2 | with open("progress.pkl", "wb") as f: pickle.dump({"index": i + 1, "results": results}, f) |
This is slower but guarantees minimal work lost.
Example 3 — Save Only the Last Completed Element
If results get too big, you can save just the last state:
1 2 3 4 5 | state = { "i": i, "partial_result": partial_object, } pickle.dump(state, open("checkpoint.pkl", "wb")) |
Example 4 — Checkpointing in Case of Errors Only (Try/Except)
1 2 3 4 5 6 7 | try: for i in range(start_index, N): do_work() except Exception as e: print("Error detected, saving checkpoint...") pickle.dump({"i": i, "results": results}, open("progress.pkl", "wb")) raise e |
Reloading After a Failure
1 2 3 4 5 | with open("progress.pkl", "rb") as f: data = pickle.load(f) start_i = data["i"] results = data["results"] |
Then restart your loop from start_i.
What to Save?
Depending on your workflow, you can pickle:
- current loop index
- accumulated results
- a cache of processed items
- model training checkpoints (if not too large)
- downloaded files metadata
- partially processed DataFrames
- temporary arrays
Pickle is ideal for storing anything Python-native.
Warnings
Don’t pickle massive NumPy arrays repeatedly
Pickling 10M-element arrays every minute is slow and produces big files.
Instead:
- save them once in
.npyor.parquet - use pickle only for metadata (indexes, filenames)
Don’t unpickle files you didn't create
Security risk.
Reusable Checkpointing System
Here is a clean, reusable, production-ready checkpointing system you can drop into any long loop.
This avoids repeating code and works for any Python objects.
1. Universal Checkpointing Utilities
You can put this in a file like checkpoint.py and import it anywhere.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | import pickle import os def save_checkpoint(path, data): """Save any Python object to a checkpoint.""" with open(path, "wb") as f: pickle.dump(data, f) def load_checkpoint(path, default=None): """Load a checkpoint if available, otherwise return default.""" if os.path.exists(path): with open(path, "rb") as f: return pickle.load(f) return default def delete_checkpoint(path): """Remove a checkpoint file.""" if os.path.exists(path): os.remove(path) |
These three simple functions handle saving, loading, and resetting checkpoints.
2. Example: Using Checkpoints in a Long Loop
This template works for loops that may take hours or days.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | from checkpoint import save_checkpoint, load_checkpoint checkpoint_path = "progress.pkl" # Try to load old progress state = load_checkpoint(checkpoint_path, default={"i": 0, "results": []}) start_i = state["i"] results = state["results"] print(f"Resuming at iteration {start_i} with {len(results)} saved results.") N = 1_000_000 # long loop for i in range(start_i, N): try: # ----- Your actual work ----- value = i ** 2 # example heavy computation results.append(value) # ---------------------------------------- # Save checkpoint every 1000 iterations if i % 1000 == 0: save_checkpoint(checkpoint_path, {"i": i + 1, "results": results}) print(f"Checkpoint saved at iteration {i}") except Exception as e: print("ERROR OCCURRED — saving checkpoint before quitting...") save_checkpoint(checkpoint_path, {"i": i, "results": results}) raise e # re-throw for debugging |
3. If your script crashes:
Just run it again and it will automatically continue:
1 | Resuming at iteration 237000 with 237000 saved results. |
No lost work.
Example for Large Data (better memory)
If your results list becomes huge, save only metadata:
1 | save_checkpoint(checkpoint_path, {"i": i}) |
and save the heavy data (arrays, dfs) separately (.npy, .parquet, .csv).
References
| Links | Site |
|---|---|
| pickle — Python object serialization | Official Python documentation for the pickle module ([Python documentation][1]) |
| Serialization — Python Guide on pickling and object persistence | Official section of Python Standard Library manual on persistence (including pickle) ([Python documentation][2]) |
| Python Pickle Example: A Guide to Serialization & Persistence (DigitalOcean) | Practical, tutorial-style explanation of how to use pickle for saving/restoring Python objects ([digitalocean.com][3]) |
