A different approach to sparse events
Many data scientists treat events as dummy variables. In the world of asset management, it is very common to include encoded risky events in financial models, including financial crises, wars, economic releases or earnings calls. Usually, the selected encoding system is the simplest, with a dummy variable represented by a vector with values in the [0,1] space, where 0 means absence while 1 means presence.
But for machine learning and Deep Learning models, this is generally insufficient. While regression models and even some Deep Learning models might be able to handle dummy variables, the latter would treat them as numerical values, not categories, while the former are properly handling the data by splitting it in different groups (such that “0” is a label, not a number).
In some cases, models might be able to find meaningful information in dummy variables. If the other variables fall in a comparable range (after preprocessing, if necessary), models might be able to produce meaningful coefficients, which would represent the impact of each state of the event (for instance, presence or absence).
However, this is usually not the optimal solution. Events should define regimes, or conditions, rather than variables. So, what viable alternatives exist for quants and data scientists? This article will explore two potential approaches.
Option 1: Embedding Layers
One limitation of many Deep Learning models, valid at least for the simplest ones, is that they can only consider numerical variables. By construction, Deep Learning models are based on a combination of layers and functions, allowing for non-linear transformations and more complex dynamics than a single coefficient would ever be able to capture. This does not come without setbacks, such as the difficulty of treating categorical variables and dummy variables.
But this is where Embedding layers can help. A more robust way of treating categorical variables, especially when the numerical value does not necessarily reflect an ordered structure, Embedding Layers conceptually constitute a sort of lookup table.
When training and using the model, each category would essentially call a different vector in the Embedding Layer. This will ensure a complete segregation of the weights used in the DL model.
There are multiple advantages in this approach:
- Learn representations of states: the model will learn the optimal weight for each of the category separately, leading to ad hoc optimizations;
- Dimensionality reduction: instead of sparse one-hot vectors, the embedding map of the layer will be generating much lower-dimensional dense vectors. This is particularly helpful for high-cardinality categorical features;
- Interpretable relationships between categories: the generated vectors might help in analysing the distance between states, allowing for a clearer distinction and understanding of how different they are between each other;
- Avoids ordinal assumptions: even in one-hot encoding, the risk of ordinal assumptions is always present. Embedding Layers completely remove the issue.
As for any model, it does come with disadvantages as well, such as:
- Determining embedding dimensions (Hyperparameter Tuning): there is no clear answer to the size of the vector to be associated with the state. This adds a parameter to fine tune when training and validating the model;
- High cardinality categorical variables: if data presents a large number of categories, the underrepresented categories might be exposed to specific underfitting, leading to bad results. However, this is a broad concern in regression analysis, being also pertinent to models with state-space representations, where the dimensionality introduced by such variables can impact parameter estimation and state inference;
- Interpretability: while the comparison between states would be more easily understandable by vectors, the direct interpretability of the weights in each vector might be more complicated, especially in large models.
Overall, Embedding Layers can greatly help in handling categories in Deep Learning. A sample implementation is shown below:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, concatenate
from tensorflow.keras.models import Model
import numpy as np
# Example Data
# Categorical feature: Color (Red, Blue, Green)
# Numerical feature: Size
# Target: Price
# Raw categorical data (string labels)
colors_raw = ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Green', 'Red']
sizes = np.array([10, 12, 8, 15, 11, 9, 13])
prices = np.array([100, 120, 85, 140, 115, 90, 130])
# 1. Label Encode the categorical feature
unique_colors = sorted(list(set(colors_raw)))
color_to_int = {color: i for i, color in enumerate(unique_colors)}
encoded_colors = np.array([color_to_int[color] for color in colors_raw])
num_colors = len(unique_colors) # Number of unique categories
embedding_dim = 2 # Choose a suitable embedding dimension
# Define the deep learning model with an embedding layer
# Input for categorical feature (integer-encoded)
categorical_input = Input(shape=(1,), name='color_input')
# Embedding layer: maps integers to dense vectors
# input_dim: number of unique categories
# output_dim: dimension of the embedding vector
# input_length: 1 since each input is a single integer
embedding_layer = Embedding(input_dim=num_colors, output_dim=embedding_dim, input_length=1)(categorical_input)
# Flatten the embedding output for dense layers
flatten_embedding = Flatten()(embedding_layer)
# Input for numerical feature
numerical_input = Input(shape=(1,), name='size_input')
# Concatenate the embedding output and numerical input
merged_features = concatenate([flatten_embedding, numerical_input])
# Add dense layers
hidden_layer = Dense(10, activation='relu')(merged_features)
output_layer = Dense(1, activation='linear')(hidden_layer) # For regression task
model = Model(inputs=[categorical_input, numerical_input], outputs=output_layer)
model.compile(optimizer='adam', loss='mse')
model.summary()
# Train the model
model.fit(
{'color_input': encoded_colors, 'size_input': sizes},
prices,
epochs=50,
batch_size=1,
verbose=0
)
# Predict
predictions = model.predict({'color_input': encoded_colors, 'size_input': sizes})
print("\nOriginal Prices:", prices)
print("Predicted Prices:", predictions.flatten())
# You can inspect the learned embeddings
color_embeddings = model.get_layer('embedding').get_weights()[0]
print("\nLearned Embeddings for Colors:")
for i, color in enumerate(unique_colors):
print(f"{color}: {color_embeddings[i]}")
#You could also potentially calculate distances between states.
Option 2: Cyclical Representations
An interesting methodology could be to translate dummy variables into cyclical variables. This would best apply to those events that repeat over time; for instance, it could be applied to economic releases whose date is known in advance.
To successfully translate recurring events into cyclical representations, knowing the expected interval between each event is required. If it is irregular, a possible solution could be to consider the median occurrence in a representative sample. Potentially, this estimate can be updated, which would be necessary if there is any possibility that the frequency changes.
By starting from a vector of 0 and 1, where each element represents a different timestamp (e.g. days), and the binary value represents the presence or absence of the event, the encoding should be based on the distance between events.
After calculating the period of the time series, the next step is to calculate the elapsed time since last occurrence. Again, it is not important if it is in minutes, hours or days, and the distance could potentially be irregular, if timestamps do not always occur at the same time.
Then, compute the “wrapped time” within the cycle:
where mod denotes the modulo operation.
Subsequently, convert this wrapped time into an angle in radians:
​Finally, apply the chosen trigonometric function, f(⋅) (which could be sin(⋅) or cos(⋅)), to this angle to get the encoded cyclical feature:
Combining these steps, the complete formula for the encoded feature is:
where:
- tsince​(tidx​) is the time elapsed since the last event occurrence at time index tidx​.
- P is the median period between event occurrences, calculated as P=median(Δtevent​).
- f(â‹…) is the chosen trigonometric function (e.g., sin or cos).
Additional adjustments
In case of a cyclical representation, it might also be useful to make each oscillation larger or smaller according to the variable present at that time. Sometimes, this approach is referred to as amplitude modulation.
This is mostly successful when the variable is scaled in a meaningful range, or if it is restricted in a range by construction.
The modulation would lead the curve to be more irregular, yet maybe more meaningful (depending on the scaling method applied), because larger or smaller oscillations might be more meaningful for the model. For instance, a larger inflation might lead to different outcomes compared to lower inflation prints.
Example: modular cyclical transformation of Fred CPI data
To obtain and transform the above data, a first requirement is to download the raw time series. Since this might be very convoluted in practice, AuraStream will help facilitating the process. The full code is shown below:
import requests
import pandas as pd
import numpy as np
#Inputs
args = {
"tickerlist":["USACPIALLMINMEI"],
"start":"2021-01-01",
"end":"2025-01-01",
"interval":"1d",
"source":"fred",
"credentials":{"key":"YOUR_FRED_KEY"}
}
res = requests.post(
url = "https://api.aurastream.unbiased-alpha.com/timeseries",
json = args,
headers = {"x-api-key":"YOUR_AURASTREAM_KEY"}
)
datadict = res.json()
After obtaining the required data, it can be transformed into a Pandas MultiIndex DataFrame, a useful data structure for the purpose of handling data transformations.
#Necessary for multiindex
def convert_to_multiindex_dict(data):
# Flatten the dictionary into a list of tuples (name, time, variable, value)
flattened_data = {}
for colname, times in data.items():
tempdf = pd.DataFrame(times).T
tempdf.index = pd.to_datetime(tempdf.index)
flattened_data[colname] = tempdf
df = pd.concat(flattened_data,axis=1).sort_index() #To ensure ordering
return df
df = convert_to_multiindex_dict(datadict)
#We only need value
df = df[("USACPIALLMINMEI","Value")]
#Inflation (monthly)
inflation = df.pct_change()
#Convert into cycle. First, expand index
newidx = pd.date_range(df.index[0],df.index[-1])
concat = pd.concat([df, pd.Series(index = newidx.difference(df.index))])
concat = concat.sort_index()
Finally, the methodology of Option 2 is being applied below:
#Cyclical transformation
nn = concat.notnull()
csum = nn.cumsum()
time_since = concat.groupby(csum).transform("cumcount")
time_since = time_since.mask(csum == 0)
reference_period = 20 #approximate release of CPI in business days
wrapped_time = time_since % reference_period
angle = 2 * np.pi * wrapped_time / reference_period
outser = np.sin(angle)* concat.ffill()
outser.plot(title="Inflation cycle")
The final outcome should look like the below:
Conclusions
Irregular time series, or sparse time series, might be misleading for Deep Learning models, especially when combined with more regular data. Using the transformations suggested in this article will greatly improve the accuracy of any Deep Learning model, compared to using dummy variables in the [0,1] space.