Using CatBoost AI Models For Forex Trading In MetaTrader 5 - Part 1

What is CatBoost?

CatBoost is an open-source gradient boosting library developed by Yandex, designed to handle categorical features efficiently.¹ Unlike traditional gradient boosting methods, CatBoost processes categorical data directly without extensive preprocessing, making it particularly effective for datasets with categorical variables.²

Key features of CatBoost include:

Native Handling of Categorical Features: CatBoost can process non-numeric data directly, eliminating the need for manual encoding.
Ordered Boosting: This technique reduces overfitting by using a permutation-driven approach, ensuring more accurate predictions.³
Symmetric Trees: CatBoost builds balanced trees, which enhances training speed and model efficiency.⁴
GPU Support: The library supports GPU acceleration, allowing for faster training on large datasets.

CatBoost is versatile and has been applied across various industries. For instance, JetBrains utilizes it for code completion, Cloudflare employs it for bot detection, and Careem uses it to predict future ride destinations.⁵

Overall, CatBoost offers a robust solution for machine learning tasks, especially when dealing with categorical data, providing high accuracy and efficiency with minimal data preprocessing.

How does it operate?

CatBoost operates within a gradient boosting framework, similar to algorithms like LightGBM and XGBoost. It constructs multiple decision trees in a sequential manner, where each subsequent tree is designed to correct the errors of its predecessors. This iterative process enhances the model’s accuracy over time. The final prediction is obtained by aggregating the weighted outputs of all individual trees.⁶

A distinctive feature of CatBoost is its ability to handle categorical features natively, eliminating the need for extensive preprocessing. This is achieved through a technique known as ordered boosting, which mitigates overfitting by preventing target leakage during training.

Additionally, CatBoost employs symmetric trees, where splits occur simultaneously across all nodes at each level of the decision tree. This structure facilitates faster prediction times and reduces the risk of overfitting.

CatBoost vs XGBoost and LightGBM

Aspect	CatBoost	LightGBM	XGBoost
Categorical features handling	Equipped with automatic encoding and ordered boosting for handling categorical variables.	Requires manual encoding processing such as one-hot encoding, label encoding, etc.	Requires manual encoding processing such as one-hot encoding, label encoding, etc.
Decision Tree Structure	It has symmetric decision trees, which are balanced and grow evenly. They ensure faster predictions and a lower risk of overfitting.	It has a Leaf-wise growth strategy (asymmetric) which focuses on the leaves with the highest loss. This results in deep and imbalanced trees which can carry a higher accuracy but, a greater risk of overfitting.	It has a Level-wise growth strategy (asymmetric) which grows the tree based on the best split for each node. This leads to flexible but slower predictions and a potential risk of overfitting.
Model accuracy	They provide good accuracy when working with datasets containing many categorical features due to ordered boosting and reduced risk of overfitting on smaller data.	They provide good accuracy, particularly on large and high-dimensional datasets since the leaf-wise growth strategy focuses on improving performance in areas of high error.	They provide good accuracy on most datasets but, tend to be outperformed by CatBoost on categorical datasets and LightGBM on very large datasets due to its less aggressive tree-growing strategy.
Training speed & accuracy	Usually slower to train than LightGBM but more efficient on small to medium datasets, especially when categorical features are involved.	Usually the fastest of these three, especially on large datasets due to its leaf-wise tree growth which is more efficient in high-dimensional data.	Often the slowest of these three by a tiny margin. It is very efficient for large datasets.

Deploying the CatBoost Model

To implement a CatBoost model for predicting trading signals (Buy/Sell), we first need to define our problem scenario. Our dataset comprises continuous features—Open, High, Low, and Close prices (OHLC)—and categorical features such as Day (current date), Day of the Week (Monday to Sunday), Day of the Year (1 to 365), Month (January to December), Hour (0 to 23), and Minute (0 to 59). The OHLC values are continuous, while the others are categorical.

The dataset “EURUSD_H1.csv” is pulled in real time using Python and MetaTrader 5, and you can find a downloadable copy in the Files section at the end of the article.

Install and import necessary libraries

We start by importing the CatBoost model after installing it.

pip install catboost

import sys
import numpy as np
import pandas as pd
from stockstats import wrap # https://github.com/jealous/stockstats
import catboost
from catboost import CatBoostClassifier
import sklearn
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import csv

sns.set_style("darkgrid")

forex_file = "/mnt/c/Users/matsl/Python/Forex/EURUSD_H1.csv"
onnx_file = "/mnt/c/Users/matsl/AppData/Roaming/MetaQuotes/Terminal/930119AA53207C87xxxxxxxxxxxxxxxx/MQL5/Files/CatBoost.EURUSD.OHLC.H1.onnx"

I’m using the Stockstats wrapper by Cedric Zhuang⁷ for handy inline stock statistics/indicators support.

print('Python version:{}'.format(sys.version))
print('Numpy version:{}'.format(np.__version__))
print('Pandas version:{}'.format(pd.__version__))
print('CatBoost version:{}'.format(catboost.__version__))
print('Sci-Kit Learn version:{}'.format(sklearn.__version__))

Python version:3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0]
Numpy version:1.26.4
Pandas version:2.2.3
CatBoost version:1.2.7
Sci-Kit Learn version:1.5.2

Importing the data

Reading the csv and modifying the dataset to suit our needs.

eurusd_h1 = pd.read_csv(forex_file, index_col=0, parse_dates=True, skipinitialspace=True)

eurusd_h1.rename(columns={'Open': 'open'}, inplace=True)
eurusd_h1.rename(columns={'High': 'high'}, inplace=True)
eurusd_h1.rename(columns={'Low': 'low'}, inplace=True)
eurusd_h1.rename(columns={'Close': 'close'}, inplace=True)
eurusd_h1.rename(columns={'Tick Volume': 'volume'}, inplace=True)
eurusd_h1.drop('Spread', axis='columns', inplace=True)
eurusd_h1.drop('Real Volume', axis='columns', inplace=True)

df = pd.DataFrame(eurusd_h1)
df = wrap(df)

df.head(4)

			   open	   high	    low	  close	volume
	       Time					
2023-09-29 07:00:00	1.05755	1.05814	1.05751	1.05801	   457
2023-09-29 08:00:00	1.05801	1.05861	1.05757	1.05765	  1004
2023-09-29 09:00:00	1.05766	1.05920	1.05743	1.05901	  3349
2023-09-29 10:00:00	1.05899	1.06107	1.05881	1.06064	  3762
2023-09-29 11:00:00	1.06066	1.06166	1.05994	1.06095	  3847

# Add technical indicators
df['rsi'] = df['rsi']
df['stochrsi'] = df['stochrsi']
df['atr'] = df['atr']

# Add date and time features
df['day'] = df.index.day
df['day_of_week'] = df.index.dayofweek
df['day_of_year'] = df.index.dayofyear
df['month'] = df.index.month
df['hour'] = df.index.hour
df['minute'] = df.index.minute

# Tidy up the dataframe
df.dropna(inplace=True)

# Ensure all category columns are integers
df = df.astype({'day': 'int', 'day_of_week': 'int', 'day_of_year': 'int', 'month': 'int', 'hour': 'int', 'minute': 'int'})

df.head()

			   open	   high	    low	  close	volume	      rsi	  stochrsi	     atr	day	day_of_week	day_of_year	month	hour	minute
               Time														
2023-09-29 09:00:00	1.05766	1.05920	1.05743	1.05901	  3349	80.269815	100.000000	0.001175	 29	          4	        272	         		          9	9	0
2023-09-29 10:00:00	1.05899	1.06107	1.05881	1.06064	  3762	90.309633	100.000000	0.001477	 29	          4	        272	 9	10	0
2023-09-29 11:00:00	1.06066	1.06166	1.05994	1.06095	  3847	91.224247	100.000000	0.001533	 29	          4	        272	9	11	0
2023-09-29 12:00:00	1.06082	1.06171	1.06040	1.06144	  2966	92.439019	100.000000	0.001489	 29	          4	        272	9	12	0
2023-09-29 13:00:00	1.06144	1.06164	1.06066	1.06091	  2274	79.603662	86.114784	0.001399	 29	          4	        272	9	13	0

Creating the signals for training

We shift the ‘close’ and ‘open’ columns by one row to get the future close and open price values, then we add these new columns to the dataset.

new_df = df.copy()

new_df["target_close"] = df["close"].shift(-1)
new_df["target_open"] = df["open"].shift(-1)

new_df = new_df.dropna()
new_df = new_df.reset_index(drop=True)

open_values = new_df["target_open"]
close_values = new_df["target_close"]

target = []
for i in range(len(open_values)):
    if close_values[i] > open_values[i]:
        target.append(1)
    else:
        target.append(0)

new_df["signal"] = target

print(new_df.shape)
new_df.head()

(8112, 17)
	   open	   high	    low	  close	volume	      rsi	  stochrsi	     atr	day	day_of_week	day_of_year	month	hour	minute	target_close	target_open	signal
0	1.05766	1.05920	1.05743	1.05901	  3349	80.269815	100.000000	0.001175	 29		  4		272	    9	   9	   	0	1.06064	1.05899	1
1	1.05899	1.06107	1.05881	1.06064	  3762	90.309633	100.000000	0.001477	 29		  4		272	    9	  10	0	1.06095	1.06066	1
2	1.06066	1.06166	1.05994	1.06095	  3847	91.224247	100.000000	0.001533	 29		  4		272	    9	  11	0	1.06144	1.06082	1
3	1.06082	1.06171	1.06040	1.06144	  2966	92.439019	100.000000	0.001489	 29		  4		272	    9	  12	0	1.06091	1.06144	0
4	1.06144	1.06164	1.06066	1.06091	  2274	79.603662	86.114784	0.001399	 29		  4		272	    9	  13	0	1.05943	1.06092	0

Splitting the data

With the signals ready for prediction, let’s split the data into training and testing samples.

X = new_df.drop(columns = ["target_close", "target_open", "signal"]) # we drop future values
y = new_df["signal"] # trading signals are the target variables we wanna predict 

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

We create a list of categorical features available in our dataset.

categorical_features = ["day","day_of_week", "day_of_year", "month", "hour", "minute"]

We can then use this list to convert the categorical features into string format, reflecting how categorical variables are typically stored.

X_train[categorical_features] = X_train[categorical_features].astype(str)
X_test[categorical_features] = X_test[categorical_features].astype(str)

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6489 entries, 7493 to 7270
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   open         6489 non-null   float64
 1   high         6489 non-null   float64
 2   low          6489 non-null   float64
 3   close        6489 non-null   float64
 4   volume       6489 non-null   int64  
 5   rsi          6489 non-null   float64
 6   stochrsi     6489 non-null   float64
 7   atr          6489 non-null   float64
 8   day          6489 non-null   object 
 9   day_of_week  6489 non-null   object 
 10  day_of_year  6489 non-null   object 
 11  month        6489 non-null   object 
 12  hour         6489 non-null   object 
 13  minute       6489 non-null   object 
dtypes: float64(7), int64(1), object(6)
memory usage: 760.4+ KB

Training a CatBoost Model

Before using the fit method to train the CatBoost model, let’s explore some of the key parameters⁸ that drive its functionality.

Parameter	Description
Iterations	This is the number of decision trees iterations to build. More iterations lead to better performance but also carry a risk of overfitting.
learning_rate	This factor controls the distribution of each tree to the final prediction. A smaller learning rate requires more iterations for trees to converge but often results in better models.
depth	This is the maximum depth of the trees. Deeper trees can capture more complex patterns in the data but, may often lead to overfitting.
cat_features	This is a list of categorical indices. Despite the CatBoost model being capable of detecting the categorical features, it is a good practice to explicitly instruct the model on which features are categorical ones. This helps the model understand the categorical features from a human perspective as the methods for automatically detecting the categorical variables can sometimes fail.
l2_leaf_reg	This is the L2 regularization coefficient. It helps to prevent overfitting by penalizing larger leaf weights.
border_count	This is the number of splits for each categorical feature. The higher this number the better the performance and increases computational time.
eval_metric	This is the evaluation metric that will be used during training. It helps in monitoring the model performance effectively.
early_stopping_rounds	When validation data is provided to the model, the training progress will stop if no improvement in the model’s accuracy is observed for this number of rounds. This parameter helps reduce overfitting and can save a lot of training time.

Let’s define a dictionary for the above parameters.

params = dict(
    task_type='GPU',
    iterations=300,  # Number of boosting iterations
    #learning_rate=0.01,  # The automatically defined value should be close to the optimal one.
    depth=12,  # Depth of the tree
    feature_border_type='MinEntropy', # The method used to calculate the border between the categories
    l2_leaf_reg=12,  # L2 regularization coefficient
    loss_function='Logloss', # Loss function to be optimized
    bagging_temperature=1,  # Controls intensity of Bayesian bagging
    border_count=128,  # Number of splits for categorical features
    eval_metric='Logloss',  # Metrics for validation data
    random_seed=42,  # Seed for reproducibility
    verbose=1,  # Verbosity level
    early_stopping_rounds=10  # Early stopping for validation
)

Finally, we define the CatBoost model within the Sklearn pipeline and call the fit method to train it, providing evaluation data and a list of categorical features.

pipe = Pipeline([
    ("catboost", CatBoostClassifier(**params))
])

# Fit the pipeline to the training data
pipe.fit(X_train, y_train, catboost__eval_set=(X_test, y_test), catboost__cat_features=categorical_features)

0:	learn: 0.6906250	test: 0.6928095	best: 0.6928095 (0)	total: 240ms	remaining: 1m 11s
1:	learn: 0.6884363	test: 0.6929242	best: 0.6928095 (0)	total: 463ms	remaining: 1m 8s
2:	learn: 0.6858044	test: 0.6926241	best: 0.6926241 (2)	total: 693ms	remaining: 1m 8s
3:	learn: 0.6840154	test: 0.6923136	best: 0.6923136 (3)	total: 915ms	remaining: 1m 7s
4:	learn: 0.6821880	test: 0.6922213	best: 0.6922213 (4)	total: 1.14s	remaining: 1m 7s
5:	learn: 0.6800458	test: 0.6919058	best: 0.6919058 (5)	total: 1.38s	remaining: 1m 7s
6:	learn: 0.6787011	test: 0.6914997	best: 0.6914997 (6)	total: 1.59s	remaining: 1m 6s
7:	learn: 0.6766485	test: 0.6915836	best: 0.6914997 (6)	total: 1.83s	remaining: 1m 6s
8:	learn: 0.6752735	test: 0.6913356	best: 0.6913356 (8)	total: 2.05s	remaining: 1m 6s
9:	learn: 0.6728656	test: 0.6910021	best: 0.6910021 (9)	total: 2.27s	remaining: 1m 5s
10:	learn: 0.6712232	test: 0.6906219	best: 0.6906219 (10)	total: 2.49s	remaining: 1m 5s
11:	learn: 0.6688273	test: 0.6903468	best: 0.6903468 (11)	total: 2.71s	remaining: 1m 4s
12:	learn: 0.6660441	test: 0.6905106	best: 0.6903468 (11)	total: 2.94s	remaining: 1m 4s
13:	learn: 0.6645466	test: 0.6903318	best: 0.6903318 (13)	total: 3.16s	remaining: 1m 4s
14:	learn: 0.6629403	test: 0.6906848	best: 0.6903318 (13)	total: 3.37s	remaining: 1m 4s
15:	learn: 0.6604999	test: 0.6903479	best: 0.6903318 (13)	total: 3.61s	remaining: 1m 4s
16:	learn: 0.6587366	test: 0.6902117	best: 0.6902117 (16)	total: 3.82s	remaining: 1m 3s
17:	learn: 0.6576030	test: 0.6901320	best: 0.6901320 (17)	total: 4.04s	remaining: 1m 3s
18:	learn: 0.6554191	test: 0.6898560	best: 0.6898560 (18)	total: 4.28s	remaining: 1m 3s
19:	learn: 0.6543412	test: 0.6895462	best: 0.6895462 (19)	total: 4.51s	remaining: 1m 3s
20:	learn: 0.6517582	test: 0.6893069	best: 0.6893069 (20)	total: 4.75s	remaining: 1m 3s
21:	learn: 0.6500230	test: 0.6891939	best: 0.6891939 (21)	total: 4.98s	remaining: 1m 2s
22:	learn: 0.6475220	test: 0.6890455	best: 0.6890455 (22)	total: 5.21s	remaining: 1m 2s
23:	learn: 0.6465565	test: 0.6887497	best: 0.6887497 (23)	total: 5.46s	remaining: 1m 2s
24:	learn: 0.6440090	test: 0.6885688	best: 0.6885688 (24)	total: 5.7s	remaining: 1m 2s
...
80:	learn: 0.5550032	test: 0.6859373	best: 0.6854023 (70)	total: 18.3s	remaining: 49.6s
bestTest = 0.6854023161
bestIteration = 70
Shrink model to first 71 iterations.

Evaluating the Model

We can evaluate the model’s performance using Sklearn’s metrics.

# Make predicitons on training and testing sets
y_train_pred = pipe.predict(X_train)
y_test_pred = pipe.predict(X_test)

# Training set evaluation
print("Training Set Classification Report:")
print(classification_report(y_train, y_train_pred))

# Testing set evaluation
print("\nTesting Set Classification Report:")
print(classification_report(y_test, y_test_pred))

Training Set Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.73      0.72      3162
           1       0.74      0.70      0.72      3327

    accuracy                           0.72      6489
   macro avg       0.72      0.72      0.72      6489
weighted avg       0.72      0.72      0.72      6489


Testing Set Classification Report:
              precision    recall  f1-score   support

           0       0.53      0.52      0.52       805
           1       0.53      0.54      0.54       818

    accuracy                           0.53      1623
   macro avg       0.53      0.53      0.53      1623
weighted avg       0.53      0.53      0.53      1623

To gain a deeper understanding of the model, let’s create a feature importance plot.

# Extract the trained CatBoostClassifier from the pipeline
catboost_model = pipe.named_steps['catboost']

# Get feature importances
feature_importances = catboost_model.get_feature_importance()

feature_im_df = pd.DataFrame({
    "feature": X.columns,
    "importance": feature_importances
})

feature_im_df = feature_im_df.sort_values(by="importance", ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data = feature_im_df, x='importance', y='feature', palette="viridis")

plt.title("CatBoost feature importance")
plt.xlabel("Importance")
plt.ylabel("feature")
plt.show()

The “feature importance plot” above provides a clear overview of how the model made its decisions. It appears that the CatBoost model prioritized categorical variables as the most influential features in determining the final predictions, more so than the continuous variables.

In Part 2, I will demonstrate how to save the model in ONNX format and integrate it into an Expert Advisor in MetaTrader 5 for automated trading powered by the CatBoost AI.

Files

EURUSD_H1 Download

References

Share on Facebook

Post on X

Mats Lidström

Using CatBoost AI Models for Forex Trading in MetaTrader 5 – Part 1