Volver a Naturgy — Predicción de precios Descargar .ipynb
— Notebook · naturgy-eda

Naturgy — EDA & Forecast Models

Exploratory analysis, feature engineering, and a head-to-head between XGBoost and a Keras NN on Spanish electricity (2015–2018).

Logo de Naturgy

Análisis Exploratorio de Datos (EDA) y Modelado Predictivo para Naturgy: Impulsando el Futuro Energético

Índice del Notebook

Este notebook detalla el desarrollo y la validación de modelos predictivos avanzados diseñados para proporcionar a Naturgy información crítica sobre tres pilares fundamentales del sistema energético: la demanda total de energía (total load actual), el precio del mercado eléctrico (price actual) y la generación de energía renovable (total renewable generation).

Hemos logrado construir modelos capaces de generar pronósticos fiables para estas variables con horizontes de 1 a 6 horas de antelación. Esta capacidad predictiva se traduce directamente en ventajas competitivas y operativas significativas para Naturgy:

  • Optimización de la Demanda y Generación: Al predecir la total load actual con varias horas de antelación, Naturgy puede optimizar el despacho de sus activos de generación, gestionar eficientemente las reservas energéticas y mejorar la estabilidad de la red, resultando en una operación más económica y segura.

  • Estrategia de Mercado y Gestión de Riesgos: Con pronósticos precisos del price actual, Naturgy puede afinar sus estrategias de participación en el mercado mayorista, optimizar la compra y venta de energía, y gestionar de manera más efectiva los riesgos asociados a la volatilidad de los precios, beneficiando tanto sus operaciones como las ofertas a sus clientes.

  • Integración Eficiente de Energías Renovables: La predicción de la total renewable generation permite a Naturgy anticipar la producción de sus fuentes renovables y de las del sistema en general. Esto es clave para gestionar la intermitencia, optimizar el uso de sistemas de respaldo y almacenamiento, y facilitar una integración más fluida y rentable de las energías limpias en el mix energético, alineándose con los objetivos de sostenibilidad.

1 Exploración Inicial y Limpieza de Datos

Librerías

In [1]
import time
import warnings

import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import torch
import torch.nn as nn
import torch.optim as optim
import xgboost as xgb

from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from torch.utils.data import TensorDataset, DataLoader


warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (18, 8)

Paleta de colores

#e37404
#efaa66
#f0b983
#044474
#628baa
#9bb3c4
In [2]
naturgy_orange_palette = ['#e37404', '#efaa66', '#f0b983']
naturgy_blue_palette = ['#044474', '#628baa', '#9bb3c4']

Carga y Análisis Individual de Datasets

Dataset de Energía Mix Energético Español

In [3]
df_energy = pd.read_parquet('../data/energy_silver.parquet/')
print(f"Dataset de Energía cargado con dimensiones: {df_energy.shape}")
Dataset de Energía cargado con dimensiones: (35046, 29)
In [4]
display(df_energy.head())
display(df_energy.info())
display(df_energy.describe())
time generation biomass generation fossil brown coal or lignite generation fossil coal-derived gas generation fossil gas generation fossil hard coal generation fossil oil generation fossil oil shale generation fossil peat generation geothermal ... generation waste generation wind offshore generation wind onshore forecast solar day ahead forecast wind offshore eday ahead forecast wind onshore day ahead total load forecast total load actual price day ahead price actual
0 2015-02-26 13:00:00 472.0 0.0 0.0 3524.0 2492.0 361.0 0.0 0.0 0.0 ... 202.0 0.0 10524.0 3599.0 Unknown 10387.0 33607.0 33320.0 41.10 53.80
1 2015-04-30 07:00:00 382.0 647.0 0.0 4907.0 6307.0 363.0 0.0 0.0 0.0 ... 185.0 0.0 4478.0 2132.0 Unknown 4514.0 30485.0 30581.0 59.99 71.36
2 2015-06-13 20:00:00 486.0 980.0 0.0 4444.0 6650.0 249.0 0.0 0.0 0.0 ... 280.0 0.0 3743.0 396.0 Unknown 3981.0 27127.0 26635.0 62.13 72.12
3 2015-06-23 19:00:00 496.0 538.0 0.0 4818.0 7598.0 369.0 0.0 0.0 0.0 ... 268.0 0.0 5692.0 1082.0 Unknown 5858.0 30816.0 29731.0 62.21 75.64
4 2015-06-30 23:00:00 497.0 936.0 0.0 5253.0 6619.0 274.0 0.0 0.0 0.0 ... 237.0 0.0 1971.0 2.0 Unknown 2174.0 22428.0 23341.0 49.67 60.52

5 rows × 29 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35046 entries, 0 to 35045
Data columns (total 29 columns):
 #   Column                                       Non-Null Count  Dtype         
---  ------                                       --------------  -----         
 0   time                                         35046 non-null  datetime64[ns]
 1   generation biomass                           35046 non-null  float64       
 2   generation fossil brown coal or lignite      35046 non-null  float64       
 3   generation fossil coal-derived gas           35046 non-null  float64       
 4   generation fossil gas                        35046 non-null  float64       
 5   generation fossil hard coal                  35046 non-null  float64       
 6   generation fossil oil                        35046 non-null  float64       
 7   generation fossil oil shale                  35046 non-null  float64       
 8   generation fossil peat                       35046 non-null  float64       
 9   generation geothermal                        35046 non-null  float64       
 10  generation hydro pumped storage aggregated   35046 non-null  object        
 11  generation hydro pumped storage consumption  35046 non-null  float64       
 12  generation hydro run-of-river and poundage   35046 non-null  float64       
 13  generation hydro water reservoir             35046 non-null  float64       
 14  generation marine                            35046 non-null  float64       
 15  generation nuclear                           35046 non-null  float64       
 16  generation other                             35046 non-null  float64       
 17  generation other renewable                   35046 non-null  float64       
 18  generation solar                             35046 non-null  float64       
 19  generation waste                             35046 non-null  float64       
 20  generation wind offshore                     35046 non-null  float64       
 21  generation wind onshore                      35046 non-null  float64       
 22  forecast solar day ahead                     35046 non-null  float64       
 23  forecast wind offshore eday ahead            35046 non-null  object        
 24  forecast wind onshore day ahead              35046 non-null  float64       
 25  total load forecast                          35046 non-null  float64       
 26  total load actual                            35046 non-null  float64       
 27  price day ahead                              35046 non-null  float64       
 28  price actual                                 35046 non-null  float64       
dtypes: datetime64[ns](1), float64(26), object(2)
memory usage: 7.8+ MB
None
time generation biomass generation fossil brown coal or lignite generation fossil coal-derived gas generation fossil gas generation fossil hard coal generation fossil oil generation fossil oil shale generation fossil peat generation geothermal ... generation solar generation waste generation wind offshore generation wind onshore forecast solar day ahead forecast wind onshore day ahead total load forecast total load actual price day ahead price actual
count 35046 35046.000000 35046.000000 35046.0 35046.000000 35046.000000 35046.000000 35046.0 35046.0 35046.0 ... 35046.000000 35046.000000 35046.0 35046.000000 35046.000000 35046.000000 35046.000000 35046.000000 35046.000000 35046.000000
mean 2016-12-31 17:28:37.822290688 383.513011 448.059208 0.0 5622.737488 4256.065742 298.319808 0.0 0.0 0.0 ... 1432.665925 269.452377 0.0 5464.479769 1438.787337 5470.675056 28711.264281 28696.326628 49.873031 57.875419
min 2014-12-31 23:00:00 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.0 ... 0.000000 0.000000 0.0 0.000000 0.000000 237.000000 18105.000000 18041.000000 2.060000 9.330000
25% 2016-01-01 16:15:00 333.000000 0.000000 0.0 4126.000000 2527.000000 263.000000 0.0 0.0 0.0 ... 71.000000 240.000000 0.0 2933.000000 69.000000 2979.000000 24795.000000 24809.250000 41.490000 49.340000
50% 2016-12-31 18:30:00 367.000000 509.000000 0.0 4969.000000 4474.000000 300.000000 0.0 0.0 0.0 ... 616.000000 279.000000 0.0 4849.000000 576.000000 4853.000000 28905.000000 28896.000000 50.520000 58.010000
75% 2017-12-31 19:45:00 433.000000 757.000000 0.0 6429.000000 5838.750000 330.000000 0.0 0.0 0.0 ... 2578.000000 310.000000 0.0 7398.000000 2635.000000 7350.750000 32263.000000 32187.000000 60.530000 68.000000
max 2018-12-31 22:00:00 592.000000 999.000000 0.0 20034.000000 8359.000000 449.000000 0.0 0.0 0.0 ... 5792.000000 357.000000 0.0 17436.000000 5836.000000 17430.000000 41390.000000 41015.000000 101.990000 116.800000
std NaN 85.352783 354.568590 0.0 2201.830478 1961.601013 52.519923 0.0 0.0 0.0 ... 1680.119887 50.194840 0.0 3213.691587 1677.668887 3176.376641 4592.465001 4572.643285 14.617608 14.201153

8 rows × 27 columns

In [5]
df_energy.isnull().sum()
time                                           0
generation biomass                             0
generation fossil brown coal or lignite        0
generation fossil coal-derived gas             0
generation fossil gas                          0
generation fossil hard coal                    0
generation fossil oil                          0
generation fossil oil shale                    0
generation fossil peat                         0
generation geothermal                          0
generation hydro pumped storage aggregated     0
generation hydro pumped storage consumption    0
generation hydro run-of-river and poundage     0
generation hydro water reservoir               0
generation marine                              0
generation nuclear                             0
generation other                               0
generation other renewable                     0
generation solar                               0
generation waste                               0
generation wind offshore                       0
generation wind onshore                        0
forecast solar day ahead                       0
forecast wind offshore eday ahead              0
forecast wind onshore day ahead                0
total load forecast                            0
total load actual                              0
price day ahead                                0
price actual                                   0
dtype: int64

Observaciones: Las fuentes renovables son las que más variabilidad tiene.

Por otra parte hay fuentes que no tienen variabilidad lo que posiblemente pueden eliminarse.

Hay dos columnas de tipo object, teniendo en cuenta que es una tabla de mediciones probablemente se puedan eliminar o estén etiquetadas así por algo durante el proceso de ETL en la capa bronze.

No se aprecian valores nulos.

Dataset de Clima

In [6]
df_weather = pd.read_parquet('../data/weather_features_silver.parquet')
print(f"Dataset de Clima cargado con dimensiones: {df_weather.shape}")
Dataset de Clima cargado con dimensiones: (175271, 15)
In [7]
display(df_weather.head())
display(df_weather.info())
display(df_weather.describe())
time city_name temp temp_min temp_max pressure humidity wind_speed wind_deg rain_1h rain_3h snow_3h clouds_all weather_main weather_description
0 2015-01-16 04:00:00 Valencia 12.100 12.100 12.100 1009 61 21.6 242 0.0 0.0 0.0 76 clouds broken clouds
1 2015-01-26 20:00:00 Valencia 3.748 3.748 3.748 1027 77 3.6 46 0.0 0.0 0.0 36 clouds scattered clouds
2 2015-02-24 20:00:00 Seville 9.962 9.962 9.962 1039 62 14.4 349 0.0 0.0 0.0 0 clear sky is clear
3 2015-03-01 00:00:00 Madrid 14.800 14.800 14.800 942 89 7.2 252 0.0 0.0 0.0 8 clear sky is clear
4 2015-03-01 19:00:00 Barcelona 16.600 16.600 16.600 1015 90 7.2 225 0.0 0.0 0.0 0 clear sky is clear
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175271 entries, 0 to 175270
Data columns (total 15 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   time                 175271 non-null  datetime64[ns]
 1   city_name            175271 non-null  object        
 2   temp                 175271 non-null  float64       
 3   temp_min             175271 non-null  float64       
 4   temp_max             175271 non-null  float64       
 5   pressure             175271 non-null  int32         
 6   humidity             175271 non-null  int32         
 7   wind_speed           175271 non-null  float64       
 8   wind_deg             175271 non-null  int32         
 9   rain_1h              175271 non-null  float64       
 10  rain_3h              175271 non-null  float64       
 11  snow_3h              175271 non-null  float64       
 12  clouds_all           175271 non-null  int32         
 13  weather_main         175271 non-null  object        
 14  weather_description  175271 non-null  object        
dtypes: datetime64[ns](1), float64(7), int32(4), object(3)
memory usage: 17.4+ MB
None
time temp temp_min temp_max pressure humidity wind_speed wind_deg rain_1h rain_3h snow_3h clouds_all
count 175271 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000
mean 2016-12-31 14:55:28.938614784 16.559508 15.279950 18.024507 1016.215871 68.050339 8.889137 166.727097 0.068966 0.000386 0.004848 24.343052
min 2014-12-31 23:00:00 -10.910000 -10.910000 -10.910000 918.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2016-01-01 12:00:00 10.681000 9.639750 11.777594 1013.000000 53.000000 3.600000 56.000000 0.000000 0.000000 0.000000 0.000000
50% 2016-12-31 15:00:00 16.000000 15.000000 17.000000 1018.000000 72.000000 7.200000 178.000000 0.000000 0.000000 0.000000 16.000000
75% 2017-12-31 19:00:00 22.090000 21.000000 24.000000 1022.000000 87.000000 14.400000 270.000000 0.000000 0.000000 0.000000 40.000000
max 2018-12-31 22:00:00 42.450000 42.000000 48.000000 1090.000000 100.000000 230.400000 360.000000 12.000000 2.315000 21.500000 100.000000
std NaN 8.024963 7.948565 8.613761 12.499474 21.813633 7.457877 116.543511 0.380658 0.007349 0.224579 30.338188
In [8]
df_weather.isnull().sum()
time                   0
city_name              0
temp                   0
temp_min               0
temp_max               0
pressure               0
humidity               0
wind_speed             0
wind_deg               0
rain_1h                0
rain_3h                0
snow_3h                0
clouds_all             0
weather_main           0
weather_description    0
dtype: int64

Observaciones: Podemos observar que hay datos con muy poca variabilidad sobretedo en rain y snow. Tal vez la mejor opción sea dejar solo el que sea más predictor para nuestros targets.

Por otra parte hay que comprobar las columnas temp ya que me da la sensación de que no varían tanto debido a que es por horas, entonces temp max y min pueden no ser relevantes.

3. Unión y Creación del Dataset Base

Como el dataset de energy es a nivel nacional, intentaremos hacer la media de de las condiciones del clima, que es por ciudad. Aunque perdamos algo de información, al calcular a nivel nacional no nos afectará tanto.

PROBLEMA CON LA UNIDAD DEL VIENTO

La naturaleza del viento es vectorial por lo que intar sumarlas como si fuesen escalares sería incorrecto y no representaría el Mundo real.

Dirección y velocidad del viento con componentes meridional y zonal
In [9]
numeric_cols_weather = numeric_cols_weather = df_weather.select_dtypes(include=np.number).columns

# Conversión a componentes vectoriales del viento
df_weather['wind_x'] = df_weather['wind_speed'] * np.cos(np.radians((90 - df_weather['wind_deg']) % 360))
df_weather['wind_y'] = df_weather['wind_speed'] * np.sin(np.radians((90 - df_weather['wind_deg']) % 360))

# Agregar por tiempo (promediando las componentes vectoriales)
agg_dict = {col: 'mean' for col in numeric_cols_weather}
agg_dict['wind_x'] = 'mean'
agg_dict['wind_y'] = 'mean'

wind_x es la componente zonal del viento, que indica la velocidad del viento en dirección este-oeste. wind_y es la componente meridional del viento, que indica la velocidad del viento en dirección norte-sur.

La velocidad del viento no se eliminará ya que podría indicar patrones que no están del todo relacionados con la dirección del viento, como por ejemplo la intensidad de una tormenta.

In [10]
df_weather.describe()
time temp temp_min temp_max pressure humidity wind_speed wind_deg rain_1h rain_3h snow_3h clouds_all wind_x wind_y
count 175271 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000 175271.000000
mean 2016-12-31 14:55:28.938614784 16.559508 15.279950 18.024507 1016.215871 68.050339 8.889137 166.727097 0.068966 0.000386 0.004848 24.343052 -1.377273 0.139367
min 2014-12-31 23:00:00 -10.910000 -10.910000 -10.910000 918.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -192.508113 -63.125445
25% 2016-01-01 12:00:00 10.681000 9.639750 11.777594 1013.000000 53.000000 3.600000 56.000000 0.000000 0.000000 0.000000 0.000000 -5.515520 -3.573166
50% 2016-12-31 15:00:00 16.000000 15.000000 17.000000 1018.000000 72.000000 7.200000 178.000000 0.000000 0.000000 0.000000 16.000000 0.000000 0.000000
75% 2017-12-31 19:00:00 22.090000 21.000000 24.000000 1022.000000 87.000000 14.400000 270.000000 0.000000 0.000000 0.000000 40.000000 2.757760 3.600000
max 2018-12-31 22:00:00 42.450000 42.000000 48.000000 1090.000000 100.000000 230.400000 360.000000 12.000000 2.315000 21.500000 100.000000 58.204659 160.049289
std NaN 8.024963 7.948565 8.613761 12.499474 21.813633 7.457877 116.543511 0.380658 0.007349 0.224579 30.338188 8.675849 7.579610
In [11]
numeric_cols_weather = df_weather.select_dtypes(include=np.number).columns
agg_dict = {col: 'mean' for col in numeric_cols_weather}
df_weather_agg = df_weather.groupby('time').agg(agg_dict).reset_index()
df_weather_agg.drop(columns=['wind_deg'], inplace=True) # Columna no necesaria tras la conversión a componentes vectoriales
df_merged = pd.merge(df_energy, df_weather_agg, on='time', how='inner')

# Columnas que tienen un único valor constante
cols_to_drop = [col for col in df_merged.columns if df_merged[col].nunique() == 1]
df_merged = df_merged.drop(columns=cols_to_drop, errors='ignore')

# Columnas de energías renovables
renewable_cols = [
    'generation biomass', 'generation hydro run-of-river and poundage',
    'generation hydro water reservoir', 'generation other renewable',
    'generation solar', 'generation waste', 'generation wind onshore'
]
df_merged['total_renewable_generation'] = df_merged[[col for col in renewable_cols if col in df_merged.columns]].sum(axis=1)

df_merged = df_merged.set_index('time')

print(f"Columnas eliminadas: {cols_to_drop}")
print(f"Dataset unido y limpio. Dimensiones: {df_merged.shape}")
Columnas eliminadas: ['generation fossil coal-derived gas', 'generation fossil oil shale', 'generation fossil peat', 'generation geothermal', 'generation hydro pumped storage aggregated', 'generation marine', 'generation wind offshore', 'forecast wind offshore eday ahead']
Dataset unido y limpio. Dimensiones: (35046, 33)

observaciones: Confirmamos lo que sospechábamos de algunas conlumnas con datos continuos.

El dataset tiene predicciones del TSO de algunas variables claves. Las eliminaré.

In [12]
tso_cols_to_drop = ['price day ahead', 'forecast wind onshore day ahead', 'forecast solar day ahead', 'total load forecast']
df_cleaned = df_merged.drop(columns=tso_cols_to_drop, errors='ignore')
In [13]
df_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 35046 entries, 2015-02-26 13:00:00 to 2018-12-14 11:00:00
Data columns (total 29 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   generation biomass                           35046 non-null  float64
 1   generation fossil brown coal or lignite      35046 non-null  float64
 2   generation fossil gas                        35046 non-null  float64
 3   generation fossil hard coal                  35046 non-null  float64
 4   generation fossil oil                        35046 non-null  float64
 5   generation hydro pumped storage consumption  35046 non-null  float64
 6   generation hydro run-of-river and poundage   35046 non-null  float64
 7   generation hydro water reservoir             35046 non-null  float64
 8   generation nuclear                           35046 non-null  float64
 9   generation other                             35046 non-null  float64
 10  generation other renewable                   35046 non-null  float64
 11  generation solar                             35046 non-null  float64
 12  generation waste                             35046 non-null  float64
 13  generation wind onshore                      35046 non-null  float64
 14  total load actual                            35046 non-null  float64
 15  price actual                                 35046 non-null  float64
 16  temp                                         35046 non-null  float64
 17  temp_min                                     35046 non-null  float64
 18  temp_max                                     35046 non-null  float64
 19  pressure                                     35046 non-null  float64
 20  humidity                                     35046 non-null  float64
 21  wind_speed                                   35046 non-null  float64
 22  rain_1h                                      35046 non-null  float64
 23  rain_3h                                      35046 non-null  float64
 24  snow_3h                                      35046 non-null  float64
 25  clouds_all                                   35046 non-null  float64
 26  wind_x                                       35046 non-null  float64
 27  wind_y                                       35046 non-null  float64
 28  total_renewable_generation                   35046 non-null  float64
dtypes: float64(29)
memory usage: 8.0 MB

4. Análisis Exploratorio del Dataset Combinado

Distribución de Variables Numéricas (Univariante)

In [14]
def plot_distributions(df, columns, n_cols=3, figsize=(15, None), bins=30):
    # Calcular filas necesarias
    n_rows = int(np.ceil(len(columns) / n_cols))
    if figsize[1] is None:
        figsize = (figsize[0], n_rows * 4)
    
    # Crear subplots
    fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
    axes = np.array([axes]).flatten() if n_rows * n_cols == 1 else axes.flatten()
    
    for i, col in enumerate(columns[:len(axes)]):
        ax = axes[i]
        
        # Histograma con KDE en azul
        sns.histplot(df[col], bins=bins, kde=True, ax=ax, alpha=0.7, color=naturgy_blue_palette[0])
        
        # Estadísticas básicas
        mean_val = df[col].mean()
        median_val = df[col].median()
        
        # Líneas de media y mediana
        ax.axvline(mean_val, color=naturgy_orange_palette[0], linewidth=2, label=f'Media: {mean_val:.2f}')
        ax.axvline(median_val, color=naturgy_orange_palette[1], linewidth=2, linestyle='--', 
                  label=f'Mediana: {median_val:.2f}')
        
        # Títulos y etiquetas
        ax.set_title(f'{col}')
        ax.legend(fontsize=8)
        ax.grid(True, alpha=0.3)
    
    # Ocultar ejes no usados
    for j in range(len(columns), len(axes)):
        axes[j].set_visible(False)
    
    plt.tight_layout()
    plt.show()
In [15]
num_cols = df_cleaned.select_dtypes(include=np.number).columns
plot_distributions(df_cleaned, num_cols)

Observaciones: hay bastante variedad en las distribuciones:

  • Distribuciones con cola a la derecha: una gran parte de las variables muestran este patrón. Podría indicar valores extremos que tendríamos que tratar.
  • Distribuciones normales: algunas variables presentan patrones más estables como la temperatura o el precio.
  • Distribuciones bimodales/multimodales: podrían indicar patrones en el sistema energético.
  • Distribuciones casi en 0: podrían tener muy poca relevancia para predicciones.

Análisis de las Variables Objetivo (Targets)

In [16]
targets_1 = ['total load actual', 'price actual', 'total_renewable_generation']
plot_distributions(df_cleaned, targets_1)

Observaciones

  • total load actual: se aprecia claramente una distribución bimodal, podría ser por las diferenctes estaciones. Podría ser interesante hacer una variable categórica que capture los patrones de consumo.
  • price actual: tiene una distribución asimétrica positiva con una ligera bimodalidad, podría estar relacionado con la distribución de la generación total. De todas formas tal vez se podría aplicar algún tipo de transformación, eliminar los valores extremos .
  • total_renewable_generation: distribución asimétrica pero casi hace una campana de Gauss.

Análisis de Outliers

In [17]
num_cols = df_cleaned.select_dtypes(include=np.number).columns

n_cols = 4
n_rows = int(np.ceil(len(num_cols) / n_cols))
fig, axes = plt.subplots(n_rows, n_cols, figsize=(14, n_rows * 3))
axes = axes.flatten()

box_colors = {
    'boxes': naturgy_blue_palette[0],
    'whiskers': naturgy_blue_palette[1],
    'medians': naturgy_orange_palette[0],
    'caps': naturgy_blue_palette[2],
    'fliers': naturgy_orange_palette[1]
}

for i, col in enumerate(num_cols):
    if i < len(axes):
        boxplot = axes[i].boxplot(df_cleaned[col], patch_artist=True)
        
        for element, color in box_colors.items():
            plt.setp(boxplot[element], color=color)
            
        for patch in boxplot['boxes']:
            patch.set(facecolor=naturgy_blue_palette[2], alpha=0.7)
            
        axes[i].set_title(col, fontsize=10, color=naturgy_blue_palette[0], fontweight='bold')
        axes[i].tick_params(axis='both', labelsize=8, colors=naturgy_blue_palette[1])
        axes[i].set_xlabel('', fontsize=8)
        axes[i].set_ylabel('Value', fontsize=8, color=naturgy_blue_palette[1])
        
        axes[i].grid(True, linestyle='--', alpha=0.3)
        axes[i].set_facecolor('#f8f9fa')

for j in range(i + 1, len(axes)):
    axes[j].set_visible(False)

plt.suptitle('Detección de Outliers con Boxplots', 
             fontsize=16, 
             color=naturgy_blue_palette[0],
             fontweight='bold')
plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()

Observaciones: A pesar de que se aprecian numerosos outlayers estos pueden guardan información bastante relevante, como fenómenos atmosféricos extremos. Podría ser interesante tratar solo aqullos que puedan sesgar el modelo.

Análisis de Correlaciones

In [18]
def plot_correlation_matrix(df, figsize=(24, 20), cmap='coolwarm', annot=True, fontsize_annot=8, title = "Matriz de Correlación"):
    
    fig, ax = plt.subplots(figsize=figsize)
    
    sns.heatmap(
        df, 
        annot=annot, 
        cmap=cmap, 
        center=0, 
        fmt='.2f', 
        annot_kws={"size": fontsize_annot}, 
        cbar_kws={"shrink": 0.8}
    )

    plt.title(title, fontsize=16, color='#044474', fontweight='bold')
    plt.xticks(rotation=90, fontsize=8)
    plt.yticks(rotation=0, fontsize=8)
    plt.tight_layout()
    
    return fig, ax
In [19]
plot_correlation_matrix(df_cleaned.select_dtypes(include=np.number).corr(), title="Matriz de Correlación de Variables Numéricas")
(<Figure size 2400x2000 with 2 Axes>,
 <Axes: title={'center': 'Matriz de Correlación de Variables Numéricas'}>)

Observaciones : Es una tabla con bastantes dimensiones, no obstante probablemente se eliminen variables que no apartan nada y puede añadir ruido.

  • Correlación entre fuentes fósiles: nos indica que suben y bajan de forma similar.
  • Las condiciones meteorológicos parecen factores determinantes para calcular la genereación de las renovables.
  • Las temperaturas entre ellas están demasiado correlacionadas (colinealidad).
  • La generación nuclear no parece ser un predictor interesante.

Correlación entre Targets

Analizamos específicamente la relación entre nuestras variables objetivo. Esto es clave para un modelo multi-target, ya que una alta correlación puede indicar que los modelos pueden aprender patrones compartidos.

In [20]

plot_correlation_matrix(df_cleaned[targets_1].corr(), figsize=(10, 8), title="Matriz de Correlación entre Variables Target")
(<Figure size 1000x800 with 2 Axes>,
 <Axes: title={'center': 'Matriz de Correlación entre Variables Target'}>)

Series Temporales y Estacionalidad

Creo que sería interesante añadir columnas basadas en la fecha y hora. Por la naturaleza cíclica de los datos. De esta forma el modelo podría beneficiarse de patrones que pueda proporcionar las fechas.

5 Limpieza, Procesamiento y Feature Engineering

Tratamiento de outlayers (Extremos)

In [21]
columns_to_cap = [
    'generation fossil gas', 'generation fossil oil',
    'generation hydro pumped storage consumption',
    'generation biomass', 'pressure', 'wind_speed', 'wind_x', 'wind_y',
]

df_processed = df_cleaned.copy()

for col in columns_to_cap:
    Q1 = df_processed[col].quantile(0.25)
    Q3 = df_processed[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    df_processed[col] = np.clip(df_processed[col], lower, upper)

Tranformaciones

In [22]
df_transformed = df_processed.copy()

cols_to_log1p = [
    'generation hydro pumped storage consumption',
    'rain_1h',
]

cols_to_log = [
    'generation fossil gas',
    'generation hydro water reservoir',
]

for col in cols_to_log1p:
    if col in df_transformed.columns:
        df_transformed[col] = np.log1p(df_transformed[col])
        df_transformed.rename(columns={col: f'{col}_log'}, inplace=True)
        print(f"Columna '{col}' transformada con log1p.")

epsilon = 1e-9
for col in cols_to_log:
    if col in df_transformed.columns:
        df_transformed[col] = np.log(df_transformed[col] + epsilon)
        df_transformed.rename(columns={col: f'{col}_log'}, inplace=True)
        print(f"Columna '{col}' transformada con log.")


plot_distributions(df_transformed, [f'{c}_log' for c in cols_to_log1p + cols_to_log])
Columna 'generation hydro pumped storage consumption' transformada con log1p.
Columna 'rain_1h' transformada con log1p.
Columna 'generation fossil gas' transformada con log.
Columna 'generation hydro water reservoir' transformada con log.

Las gráficas ya muestran una distribución más simétricas en el sentido que media = mediana (aprox)

Con clauds_all provoca un sesgo hacia la izquierda bastante grande y no mejora la distribución, no aplicaré log.

Análisis del Factor de Inflación de la Varianza (VIF)

El VIF es una medida numérica para la multicolinealidad. Un valor de VIF > 10 es una señal fuerte de que una variable es redundante y podría ser eliminada. Esto ayuda a crear modelos más simples y estables.

In [23]
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(df_numeric):
    vif_data = pd.DataFrame()
    vif_data['feature'] = df_numeric.columns
    vif_data['VIF'] = [variance_inflation_factor(df_numeric.values, i) for i in range(len(df_numeric.columns))]
    return vif_data.sort_values(by='VIF', ascending=False)

numeric_df = df_transformed.select_dtypes(include=np.number).dropna()
vif_results = calculate_vif(numeric_df)

print("Resultados del VIF (Factor de Inflación de la Varianza):")
display(vif_results.head(20))
Resultados del VIF (Factor de Inflación de la Varianza):
feature VIF
16 temp 7714.952660
19 pressure 2887.828290
2 generation fossil gas_log 2784.263753
18 temp_max 2056.436473
17 temp_min 1988.810284
7 generation hydro water reservoir_log 413.393396
14 total load actual 394.255558
28 total_renewable_generation 307.434760
10 generation other renewable 111.016859
20 humidity 76.294145
8 generation nuclear 72.803474
13 generation wind onshore 69.761796
12 generation waste 66.919433
4 generation fossil oil 65.248394
0 generation biomass 64.965086
15 price actual 31.559148
3 generation fossil hard coal 28.369906
6 generation hydro run-of-river and poundage 28.091222
9 generation other 19.192147
11 generation solar 10.053490

Para producción se podría plantear alguna lógica a la hora de elimnar estas columnas

In [24]
# Se incluen algunas columnas con vif > 10, las de generación de energías renovables ya que tienen una correlación demasiado alta ya con la total.
cols_to_drop_vif = [
    'generation biomass', 'generation hydro run-of-river and poundage',
    'generation hydro water reservoir', 'generation other renewable',
    'generation solar', 'generation waste', 'generation wind onshore',
    'temp_min', 'temp_max',
]

df_vif_cleaned = df_transformed.drop(columns=cols_to_drop_vif, errors='ignore')
print(f"Dimensiones tras eliminar variables con alto VIF: {df_vif_cleaned.shape}")
df_vif_cleaned.info()
Dimensiones tras eliminar variables con alto VIF: (35046, 21)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 35046 entries, 2015-02-26 13:00:00 to 2018-12-14 11:00:00
Data columns (total 21 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   generation fossil brown coal or lignite          35046 non-null  float64
 1   generation fossil gas_log                        35046 non-null  float64
 2   generation fossil hard coal                      35046 non-null  float64
 3   generation fossil oil                            35046 non-null  float64
 4   generation hydro pumped storage consumption_log  35046 non-null  float64
 5   generation hydro water reservoir_log             35046 non-null  float64
 6   generation nuclear                               35046 non-null  float64
 7   generation other                                 35046 non-null  float64
 8   total load actual                                35046 non-null  float64
 9   price actual                                     35046 non-null  float64
 10  temp                                             35046 non-null  float64
 11  pressure                                         35046 non-null  float64
 12  humidity                                         35046 non-null  float64
 13  wind_speed                                       35046 non-null  float64
 14  rain_1h_log                                      35046 non-null  float64
 15  rain_3h                                          35046 non-null  float64
 16  snow_3h                                          35046 non-null  float64
 17  clouds_all                                       35046 non-null  float64
 18  wind_x                                           35046 non-null  float64
 19  wind_y                                           35046 non-null  float64
 20  total_renewable_generation                       35046 non-null  float64
dtypes: float64(21)
memory usage: 5.9 MB

6 Featuring Engineering avanzado

Creamos características temporales y los tranformamos en componentes cíclicas de seno y coseno para capturar la naturaleza cíclica de los datos temporales.

In [25]
df_vif_cleaned['hour'] = df_vif_cleaned.index.hour
df_vif_cleaned['dayofweek'] = df_vif_cleaned.index.dayofweek
df_vif_cleaned['month'] = df_vif_cleaned.index.month
df_vif_cleaned['year'] = df_vif_cleaned.index.year
df_vif_cleaned['dayofyear'] = df_vif_cleaned.index.dayofyear

# Features cíclicas con sen y cos
# Con esto resolvemos el problema del reloj, el reloj no es una linea recta sino que es un círculo.
df_vif_cleaned['hour_sin'] = np.sin(2 * np.pi * df_vif_cleaned['hour']/24.0)
df_vif_cleaned['hour_cos'] = np.cos(2 * np.pi * df_vif_cleaned['hour']/24.0)

# Para el día de la semana (ciclo de 7 días)
df_vif_cleaned['dayofweek_sin'] = np.sin(2 * np.pi * df_vif_cleaned['dayofweek']/7.0)
df_vif_cleaned['dayofweek_cos'] = np.cos(2 * np.pi * df_vif_cleaned['dayofweek']/7.0)

# Para el mes del año (ciclo de 12 meses)
df_vif_cleaned['month_sin'] = np.sin(2 * np.pi * df_vif_cleaned['month']/12.0)
df_vif_cleaned['month_cos'] = np.cos(2 * np.pi * df_vif_cleaned['month']/12.0)

# Para el día del año (ciclo de 365 días)
df_vif_cleaned['dayofyear_sin'] = np.sin(2 * np.pi * df_vif_cleaned['dayofyear']/365.0)
df_vif_cleaned['dayofyear_cos'] = np.cos(2 * np.pi * df_vif_cleaned['dayofyear']/365.0)

# Eliminar las columnas temporales lineales
cols_to_drop_cyclical = ['hour', 'dayofweek', 'month', 'year', 'dayofyear']
df_vif_cleaned = df_vif_cleaned.drop(columns=cols_to_drop_cyclical, errors='ignore')
print(f"Dimensiones tras añadir variables cíclicas: {df_vif_cleaned.shape}")
Dimensiones tras añadir variables cíclicas: (35046, 29)

Para modelos de predicciones en el futuro, lo ideal es que el conjunto de datos para los modelos aporten preguntas tan simples como: ¿Qué fue lo último que paso hace una?, ¿Cómo era la tendencia hace 24h?, etc. Siguiendo esta lógico podemos aprovecharnos de dos estrategias fundamentales, de esta forma capturaremos información bastante interesante de estas series temporales.

Ordenación de los dato en orden cronológico

Para Poder validar los resultados.

In [26]
print(df_vif_cleaned.index[:5])
DatetimeIndex(['2015-02-26 13:00:00', '2015-04-30 07:00:00',
               '2015-06-13 20:00:00', '2015-06-23 19:00:00',
               '2015-06-30 23:00:00'],
              dtype='datetime64[ns]', name='time', freq=None)
In [27]
df_vif_cleaned = df_vif_cleaned.sort_index()
print(df_vif_cleaned.index[:5])
DatetimeIndex(['2014-12-31 23:00:00', '2015-01-01 00:00:00',
               '2015-01-01 01:00:00', '2015-01-01 02:00:00',
               '2015-01-01 03:00:00'],
              dtype='datetime64[ns]', name='time', freq=None)

Lags

Es el valor de la misma variable en un período anterior (t) Me recuerda un poco al lag de los videojuegos.

In [28]
dimensions_before = df_vif_cleaned.shape
def add_lags(df, col, lags, prefix=None):
    if prefix is None:
        prefix = col
    for lag in lags:
        df[f'{prefix}_lag_{lag}h'] = df[col].shift(lag)
    return df

# Lags para nuestros targets
lags_targets = [1, 3, 6, 12, 24, 48, 168, 336]
# Los lags de la generación de renovables ya se generan en las de la generacion de energía
df_vif_cleaned = add_lags(df_vif_cleaned, 'total load actual', lags_targets, prefix='load')
df_vif_cleaned = add_lags(df_vif_cleaned, 'price actual', lags_targets, prefix='price')

# Para las variables de clima
lags_weather = [1, 3, 6, 12, 24, 48, 168, 336]
for col in df_vif_cleaned.columns:
    if 'temp' in col or 'rain' in col or 'wind' in col or 'pressure' in col or 'clouds' in col:
        df_vif_cleaned = add_lags(df_vif_cleaned, col, lags_weather, prefix=col)

# Para la generación de energía
lags_generation = [1, 3, 6, 12, 24, 48, 168, 336]
for col in df_vif_cleaned.columns:
    if 'generation' in col:
        df_vif_cleaned = add_lags(df_vif_cleaned, col, lags_generation, prefix=col)

7 Pipeline de Procesamiento de datos

In [29]
def preprocess_data_pipeline(energy_parquet_path: str, 
                             weather_parquet_path: str, 
                             is_training: bool = True) -> pd.DataFrame:
    """
    Función completa para preprocesar los datos de energía y clima

    Args:
        energy_parquet_path (str): Ruta al archivo parquet de datos de energía.
        weather_parquet_path (str): Ruta al archivo parquet de datos de clima.
        is_training (bool): True si los datos se procesan para entrenamiento 
                              (se crearán targets futuros). 
                              False para predicción (no se crean targets futuros).

    Returns:
        pd.DataFrame: DataFrame procesado y listo para el siguiente paso 
                      (separación X/y o predicción).
    """
    print("Iniciando pipeline de preprocesamiento de datos...")

    # --- 1. Carga de Datasets ---
    print(f"Cargando datos de energía desde: {energy_parquet_path}")
    df_energy = pd.read_parquet(energy_parquet_path)
    print(f"Cargando datos de clima desde: {weather_parquet_path}")
    df_weather = pd.read_parquet(weather_parquet_path)

    # --- 2. Preprocesamiento de Datos de Clima y Unión ---
    print("Preprocesando datos de clima y uniendo datasets...")
    if 'wind_speed' in df_weather.columns and 'wind_deg' in df_weather.columns:
        df_weather['wind_x'] = df_weather['wind_speed'] * np.cos(np.radians((90 - df_weather['wind_deg']) % 360))
        df_weather['wind_y'] = df_weather['wind_speed'] * np.sin(np.radians((90 - df_weather['wind_deg']) % 360))
    else:
        print("Advertencia: 'wind_speed' o 'wind_deg' no encontradas en datos de clima. Omitiendo wind_x/wind_y.")

    numeric_cols_weather = df_weather.select_dtypes(include=np.number).columns.tolist()
    agg_dict = {col: 'mean' for col in numeric_cols_weather if col in df_weather.columns}
    if 'wind_x' in agg_dict: agg_dict['wind_x'] = 'mean' # Asegurar que se usa si existe
    if 'wind_y' in agg_dict: agg_dict['wind_y'] = 'mean'

    if 'time' not in df_weather.columns:
        raise ValueError("Columna 'time' no encontrada en df_weather, necesaria para la agregación.")
    df_weather_agg = df_weather.groupby('time').agg(agg_dict).reset_index()

    df_merged = pd.merge(df_energy, df_weather_agg, on='time', how='inner')

    cols_to_drop_constant = [col for col in df_merged.columns if df_merged[col].nunique(dropna=False) == 1]
    if cols_to_drop_constant:
        print(f"Eliminando columnas constantes: {cols_to_drop_constant}")
        df_merged = df_merged.drop(columns=cols_to_drop_constant, errors='ignore')

    renewable_cols_list = [
        'generation biomass', 'generation hydro run-of-river and poundage',
        'generation hydro water reservoir', 'generation other renewable',
        'generation solar', 'generation waste', 'generation wind onshore'
    ]
    existing_renewable_cols = [col for col in renewable_cols_list if col in df_merged.columns]
    if existing_renewable_cols:
        df_merged['total_renewable_generation'] = df_merged[existing_renewable_cols].sum(axis=1)
    else:
        print("Advertencia: Columnas de generación renovable no encontradas. 'total_renewable_generation' será 0.")
        df_merged['total_renewable_generation'] = 0
    
    if 'time' not in df_merged.columns:
        raise ValueError("Columna 'time' no encontrada en df_merged, necesaria para el índice.")
    df_merged = df_merged.set_index('time')
    df_merged.index = pd.to_datetime(df_merged.index)

    tso_cols_to_drop_list = ['price day ahead', 'forecast wind onshore day ahead', 'forecast solar day ahead', 'total load forecast']
    existing_tso_cols = [col for col in tso_cols_to_drop_list if col in df_merged.columns]
    if existing_tso_cols:
        print(f"Eliminando columnas TSO: {existing_tso_cols}")
        df_processed = df_merged.drop(columns=existing_tso_cols, errors='ignore')
    else:
        df_processed = df_merged.copy()

    # --- 3. Limpieza, Procesamiento y Feature Engineering Adicional ---
    print("Aplicando tratamiento de outliers y transformaciones...")
    columns_to_cap_list = [
        'generation fossil gas', 'generation fossil oil', 
        'generation biomass', 'pressure', 'wind_speed', 'wind_x', 'wind_y',
    ]
    for col in columns_to_cap_list:
        if col in df_processed.columns:
            Q1 = df_processed[col].quantile(0.25); Q3 = df_processed[col].quantile(0.75)
            IQR = Q3 - Q1
            df_processed[col] = np.clip(df_processed[col], Q1 - 1.5 * IQR, Q3 + 1.5 * IQR)
        else: print(f"Advertencia: Columna para capping '{col}' no encontrada.")

    cols_to_log1p_list = ['generation fossil oil', 'generation hydro pumped storage consumption']
    cols_to_log_list = ['generation fossil gas', 'generation hydro water reservoir', 'wind_speed']
    epsilon = 1e-9

    for col in cols_to_log1p_list:
        if col in df_processed.columns:
            df_processed[f'{col}_log1p'] = np.log1p(df_processed[col])
            df_processed.drop(columns=[col], inplace=True, errors='ignore')
        else: print(f"Advertencia: Columna para log1p '{col}' no encontrada.")
            
    for col in cols_to_log_list:
        if col in df_processed.columns:
            df_processed[f'{col}_log'] = np.log(df_processed[col] + epsilon)
            df_processed.drop(columns=[col], inplace=True, errors='ignore') 
        else: print(f"Advertencia: Columna para log '{col}' no encontrada.")

    print("Creando características temporales...")
    df_processed['hour_val'] = df_processed.index.hour
    df_processed['dayofweek_val'] = df_processed.index.dayofweek
    df_processed['month_val'] = df_processed.index.month
    df_processed['year_val'] = df_processed.index.year
    df_processed['dayofyear_val'] = df_processed.index.dayofyear

    df_processed['hour_sin'] = np.sin(2 * np.pi * df_processed['hour_val'] / 24.0)
    df_processed['hour_cos'] = np.cos(2 * np.pi * df_processed['hour_val'] / 24.0)
    df_processed['dayofweek_sin'] = np.sin(2 * np.pi * df_processed['dayofweek_val'] / 7.0)
    df_processed['dayofweek_cos'] = np.cos(2 * np.pi * df_processed['dayofweek_val'] / 7.0)
    df_processed['month_sin'] = np.sin(2 * np.pi * df_processed['month_val'] / 12.0)
    df_processed['month_cos'] = np.cos(2 * np.pi * df_processed['month_val'] / 12.0)
    df_processed['dayofyear_sin'] = np.sin(2 * np.pi * df_processed['dayofyear_val'] / 365.0)
    df_processed['dayofyear_cos'] = np.cos(2 * np.pi * df_processed['dayofyear_val'] / 365.0)

    cols_to_drop_vif_list = [ 
        'generation biomass', 'generation hydro run-of-river and poundage',
        'generation hydro water reservoir',
        'generation hydro water reservoir_log',
        'generation other renewable', 'generation solar', 'generation waste', 
        'generation wind onshore', 'temp_min', 'temp_max',
        'year_val', 'month_val', 'dayofyear_val', 'hour_val', 'dayofweek_val'
    ]
    
    actual_cols_to_drop_vif = [col for col in cols_to_drop_vif_list if col in df_processed.columns]
    if actual_cols_to_drop_vif:
        print(f"Eliminando columnas por VIF (predefinidas): {actual_cols_to_drop_vif}")
        df_vif_cleaned = df_processed.drop(columns=actual_cols_to_drop_vif, errors='ignore')
    else:
        df_vif_cleaned = df_processed.copy()

    # --- 4. Feature Engineering Avanzado (Lags) ---
    print("Aplicando feature engineering avanzado (lags)...")
    df_vif_cleaned = df_vif_cleaned.sort_index()

    lag_periods = [1, 3, 6, 12, 24, 48, 168, 336]
    
    cols_for_target_lags = {
        'load': 'total load actual', 
        'price': 'price actual',
        'renewable_gen': 'total_renewable_generation'
    }
    for prefix, source_col in cols_for_target_lags.items():
        if source_col in df_vif_cleaned.columns:
            for lag in lag_periods:
                df_vif_cleaned[f'{prefix}_lag_{lag}h'] = df_vif_cleaned[source_col].shift(lag)
        else: print(f"Advertencia: Columna fuente para lags de target '{source_col}' no encontrada.")

    cols_to_exclude_from_other_lags = list(cols_for_target_lags.values()) + \
                                      [col for col in df_vif_cleaned.columns if '_lag_' in col or '_sin' in col or '_cos' in col]
    
    other_cols_for_lags = [col for col in df_vif_cleaned.columns if col not in cols_to_exclude_from_other_lags and df_vif_cleaned[col].dtype in [np.float64, np.int64]]

    for col_name in other_cols_for_lags:
        safe_prefix = col_name.replace(' ', '_').replace('-', '_').lower()
        for lag in lag_periods:
            df_vif_cleaned[f'{safe_prefix}_lag_{lag}h'] = df_vif_cleaned[col_name].shift(lag)
            
    rolling_lag_windows_list = [1, 3, 6, 12, 24, 48]
    rolling_base_cols_map = {
        'load_mean': 'total load actual',
        'price_mean': 'price actual',
        'renewable_mean': 'total_renewable_generation'
    }
    for prefix_key, original_col in rolling_base_cols_map.items():
        if original_col in df_vif_cleaned.columns:
            for lag_window in rolling_lag_windows_list:
                df_vif_cleaned[f'{prefix_key}_{lag_window}h_lag_{lag_window}h'] = df_vif_cleaned[original_col].shift(lag_window)
        else: print(f"Advertencia: Columna base para 'rolling lags' '{original_col}' no encontrada.")
    
    print("Eliminando NaNs introducidos por lags...")
    df_vif_cleaned = df_vif_cleaned.dropna()

    # --- 5. Creación de Variables Objetivo Futuras ---
    final_df = df_vif_cleaned.copy()
    if is_training:
        print("Creando variables objetivo futuras...")
        targets_to_create_map = {
            'renewable_H_plus_6': ('total_renewable_generation', -6),
            'load_H_plus_6': ('total load actual', -6),
            'price_H_plus_6': ('price actual', -6),
            'renewable_H_plus_1': ('total_renewable_generation', -1),
            'load_H_plus_1': ('total load actual', -1),
            'price_H_plus_1': ('price actual', -1)
        }
        for new_target_col, (source_col, shift_val) in targets_to_create_map.items():
            if source_col in final_df.columns: # Verificar si la columna fuente existe
                final_df[new_target_col] = final_df[source_col].shift(shift_val)
            else:
                print(f"Advertencia: Columna fuente '{source_col}' para crear target '{new_target_col}' no encontrada. Target no se creará.")
        
        print("Eliminando NaNs introducidos por el shift de targets...")
        final_df = final_df.dropna()
    else:
        print("Modo predicción: No se crearán variables objetivo futuras.")

    print(f"Pipeline de preprocesamiento completado. Dimensiones finales del DataFrame: {final_df.shape}")
    return final_df

8 Preparación y Entrenamiento de modelos

In [30]
# Ruta para guardar los modelos
models_path = '../Modelo_Entrenado/'

En este apartado experimentaré principalment con dos modelos:

  • XGBoost: me pareció una opción interesante por su eficiencia en el entrenamiento, probé con random forest y tardaba más con resultados similares. Su opción de MultiOutputRegressor simplifica mucho el entrenamiento ideal para nuetro caso que tenemos varios targets. Además con una configuración sencilla de buenos resultados.

  • Red Neuronal: interesante para nuestras variables, especialmente para predecir el precio, que en nustro contexto es una de las variables más difíciles de predecir debido a su volatilidad, no tienen un patrón cíclico tan marcado como la generación o demanda.

Crear las variables objetivo futuras

In [31]
df_vif_cleaned['renewable_H_plus_6'] = df_vif_cleaned['total_renewable_generation'].shift(-6)
df_vif_cleaned['load_H_plus_6'] = df_vif_cleaned['total load actual'].shift(-6)

df_vif_cleaned['renewable_H_plus_1'] = df_vif_cleaned['total_renewable_generation'].shift(-1)
df_vif_cleaned['load_H_plus_1'] = df_vif_cleaned['total load actual'].shift(-1)
df_vif_cleaned['price_H_plus_1'] = df_vif_cleaned['price actual'].shift(-1)
df_vif_cleaned = df_vif_cleaned.dropna() 

Division de datos entrenamiento validacion y prueba final

Dejaremos un mes de datos totalmente aislados para la prueba final de los modelos.

In [32]
cutoff_date = pd.to_datetime('2018-12-10')
df_dev = df_vif_cleaned[df_vif_cleaned.index < cutoff_date]
In [33]
targets = ['renewable_H_plus_6', 'load_H_plus_6',
           'renewable_H_plus_1', 'load_H_plus_1', 'price_H_plus_1']

#Datos para la prueba final.
df_final_test = df_vif_cleaned[df_vif_cleaned.index >= cutoff_date] 
X_final_test = df_final_test.drop(columns=targets, errors='ignore')
y_final_test = df_final_test[targets]

# Datos para el entrenamiento
X_dev_features = df_dev.drop(columns=targets, errors='ignore')
y_dev_targets = df_dev[targets]

# Split de datos para el entrenamiento
X_train, X_val, y_train, y_val = train_test_split(
    X_dev_features, y_dev_targets, test_size=0.2, random_state=42, shuffle=False
)

Preprocesamiento y Escalado

In [34]
numeric_features = X_train.select_dtypes(include=np.number).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features)
    ]
)

X_train_scaled = preprocessor.fit_transform(X_train)

X_val_scaled = preprocessor.transform(X_val)

XGBoost

Configuración mínima. Tiene por defecto tiene regularización L1 y L2 Con esta configuración da muy buenos resultados sin afectar a la velocidad de entrenamiento. Hice alguna prueba con grid search pero los resultados no mejoran, entonces preferí ponerlos a mano.

Entrenamiento y Guardado de XGBoost

In [35]
xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', MultiOutputRegressor(xgb.XGBRegressor(objective='reg:squarederror', random_state=42, n_estimators=1700, 
                                                        max_depth=8, subsample=0.8,
                                                        learning_rate=0.005, tree_method='gpu_hist', gpu_id=0
                                                        )))
])

print("Entrenando el modelo XGBoost...")
start_time_xgb = time.time()
xgb_pipeline.fit(X_train, y_train)
end_time_xgb = time.time()
training_time_xgb = end_time_xgb - start_time_xgb

print("Entrenamiento completado.")
print(f"Tiempo de entrenamiento de XGBoost: {training_time_xgb:.2f} segundos")
Entrenando el modelo XGBoost...
Entrenamiento completado.
Tiempo de entrenamiento de XGBoost: 177.59 segundos
In [36]
model_filename_xgb = models_path + '/xgb_model-energy_weather.joblib'
joblib.dump(xgb_pipeline, model_filename_xgb)
print(f"Modelo XGBoost guardado en: {model_filename_xgb}")
Modelo XGBoost guardado en: ../Modelo_Entrenado//xgb_model-energy_weather.joblib

Red Neuronal

La red neural que he diseñado es bastante sencilla

  • Estructura: 1 capa de entrada, 2 ocultas (256, 128), y 1 capa de salida.
  • Hiperparámetros
    • BatchNorm: después de cada capa densa para estabilizar y acelerar el entrenamiento.
    • Dropout(0.2): para prevenir sobreajuste. Está configurado al 20% para regularizar fuertemente el modelo.
  • Optimizador: Adam. Es el que mejor funciona generalmente.
  • Función de activación: ReLU en las capas ocultas, lineal en la capa de salida para regresión.
  • Función de pérdida: (L1Loss) pérdida del error medio cuadrático.
  • Learning Rate: Comenzamos con un valor de 0.00005 y utilizamos un scheduler para reducir la tasa de aprendizaje cuando la pérdida de validación se estanca.
  • Early Stopping: Detenemos el entrenamiento si la pérdida de validación no mejora durante 40 épocas.

Preparación de Datos para Red Neuronal

In [37]
X_train_nn = X_train_scaled
X_val_nn = X_val_scaled

y_scaler = StandardScaler()
y_train_nn_scaled = y_scaler.fit_transform(y_train)
y_val_nn_scaled = y_scaler.transform(y_val)

train_loader = DataLoader(
    TensorDataset(torch.tensor(X_train_nn, dtype=torch.float32), torch.tensor(y_train_nn_scaled, dtype=torch.float32)),
    batch_size=64, shuffle=True
)

val_loader = DataLoader(
    TensorDataset(torch.tensor(X_val_nn, dtype=torch.float32), torch.tensor(y_val_nn_scaled, dtype=torch.float32)),
    batch_size=64, shuffle=False
)

Entrenamiento y Guardado de Red Neuronal

In [38]
# Modelo
class MLP(nn.Module):
    def __init__(self, input_size, output_size):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 128), 
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, output_size)
        )
    def forward(self, x): return self.layers(x)

# Parámetros del modelo
epochs = 250
patience = 40
min_delta = 0.001
best_val_loss = float('inf')
patience_counter = 0

model = MLP(X_train_nn.shape[1], y_train.shape[1])
criterion = nn.L1Loss()
optimizer = optim.Adam(model.parameters(), lr=0.00001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=20)

train_losses, test_losses = [], []

print("Entrenando el modelo de Red Neuronal (PyTorch) con Early Stopping...")
start_time_nn = time.time()

for epoch in range(epochs):
    # Entrenamiento
    model.train()
    train_loss = 0
    for Xb, yb in train_loader:
        optimizer.zero_grad()
        loss = criterion(model(Xb), yb)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    train_losses.append(train_loss / len(train_loader))
    
    # Validación
    model.eval()
    with torch.no_grad():
        test_loss = sum(criterion(model(Xb), yb).item() for Xb, yb in val_loader) / len(val_loader)
    test_losses.append(test_loss)
    
    # Learning rate scheduler
    scheduler.step(test_loss)
    
    # Early stopping
    if test_loss < best_val_loss - min_delta:
        best_val_loss = test_loss
        patience_counter = 0
        # Guardar el mejor modelo
        torch.save(model.state_dict(), 'best_nn_model.pth')
    else:
        patience_counter += 1
    
    print(f'Epoch [{epoch+1}/{epochs}], Train Loss: {train_losses[-1]:.4f}, Test Loss: {test_losses[-1]:.4f}, LR: {optimizer.param_groups[0]["lr"]:.6f}')
    
    if patience_counter >= patience:
        print(f'Early stopping en época {epoch+1}')
        break

# Cargar el mejor modelo
model.load_state_dict(torch.load('best_nn_model.pth'))

end_time_nn = time.time()
training_time_nn = end_time_nn - start_time_nn

print("Entrenamiento de la Red Neuronal completado.")
print(f"Tiempo de entrenamiento de la Red Neuronal: {training_time_nn:.2f} segundos")
print(f"Mejor pérdida de validación: {best_val_loss:.4f}")
Entrenando el modelo de Red Neuronal (PyTorch) con Early Stopping...
Epoch [1/250], Train Loss: 0.7543, Test Loss: 0.6501, LR: 0.000010
Epoch [2/250], Train Loss: 0.6144, Test Loss: 0.5692, LR: 0.000010
Epoch [3/250], Train Loss: 0.5488, Test Loss: 0.5051, LR: 0.000010
Epoch [4/250], Train Loss: 0.5083, Test Loss: 0.4794, LR: 0.000010
Epoch [5/250], Train Loss: 0.4792, Test Loss: 0.4436, LR: 0.000010
Epoch [6/250], Train Loss: 0.4587, Test Loss: 0.4114, LR: 0.000010
Epoch [7/250], Train Loss: 0.4418, Test Loss: 0.4014, LR: 0.000010
Epoch [8/250], Train Loss: 0.4291, Test Loss: 0.4041, LR: 0.000010
Epoch [9/250], Train Loss: 0.4189, Test Loss: 0.3831, LR: 0.000010
Epoch [10/250], Train Loss: 0.4123, Test Loss: 0.3628, LR: 0.000010
Epoch [11/250], Train Loss: 0.4032, Test Loss: 0.3633, LR: 0.000010
Epoch [12/250], Train Loss: 0.3956, Test Loss: 0.3519, LR: 0.000010
Epoch [13/250], Train Loss: 0.3904, Test Loss: 0.3480, LR: 0.000010
Epoch [14/250], Train Loss: 0.3869, Test Loss: 0.3416, LR: 0.000010
Epoch [15/250], Train Loss: 0.3815, Test Loss: 0.3338, LR: 0.000010
Epoch [16/250], Train Loss: 0.3770, Test Loss: 0.3372, LR: 0.000010
Epoch [17/250], Train Loss: 0.3710, Test Loss: 0.3271, LR: 0.000010
Epoch [18/250], Train Loss: 0.3713, Test Loss: 0.3240, LR: 0.000010
Epoch [19/250], Train Loss: 0.3662, Test Loss: 0.3272, LR: 0.000010
Epoch [20/250], Train Loss: 0.3621, Test Loss: 0.3229, LR: 0.000010
Epoch [21/250], Train Loss: 0.3595, Test Loss: 0.3234, LR: 0.000010
Epoch [22/250], Train Loss: 0.3573, Test Loss: 0.3115, LR: 0.000010
Epoch [23/250], Train Loss: 0.3535, Test Loss: 0.3133, LR: 0.000010
Epoch [24/250], Train Loss: 0.3504, Test Loss: 0.3124, LR: 0.000010
Epoch [25/250], Train Loss: 0.3487, Test Loss: 0.3112, LR: 0.000010
Epoch [26/250], Train Loss: 0.3453, Test Loss: 0.3144, LR: 0.000010
Epoch [27/250], Train Loss: 0.3430, Test Loss: 0.3018, LR: 0.000010
Epoch [28/250], Train Loss: 0.3418, Test Loss: 0.3078, LR: 0.000010
Epoch [29/250], Train Loss: 0.3398, Test Loss: 0.2985, LR: 0.000010
Epoch [30/250], Train Loss: 0.3379, Test Loss: 0.3023, LR: 0.000010
Epoch [31/250], Train Loss: 0.3356, Test Loss: 0.2980, LR: 0.000010
Epoch [32/250], Train Loss: 0.3335, Test Loss: 0.2972, LR: 0.000010
Epoch [33/250], Train Loss: 0.3336, Test Loss: 0.2962, LR: 0.000010
Epoch [34/250], Train Loss: 0.3315, Test Loss: 0.3312, LR: 0.000010
Epoch [35/250], Train Loss: 0.3290, Test Loss: 0.2987, LR: 0.000010
Epoch [36/250], Train Loss: 0.3284, Test Loss: 0.2952, LR: 0.000010
Epoch [37/250], Train Loss: 0.3282, Test Loss: 0.2838, LR: 0.000010
Epoch [38/250], Train Loss: 0.3237, Test Loss: 0.2873, LR: 0.000010
Epoch [39/250], Train Loss: 0.3226, Test Loss: 0.2870, LR: 0.000010
Epoch [40/250], Train Loss: 0.3216, Test Loss: 0.2902, LR: 0.000010
Epoch [41/250], Train Loss: 0.3196, Test Loss: 0.2892, LR: 0.000010
Epoch [42/250], Train Loss: 0.3187, Test Loss: 0.2918, LR: 0.000010
Epoch [43/250], Train Loss: 0.3169, Test Loss: 0.2775, LR: 0.000010
Epoch [44/250], Train Loss: 0.3156, Test Loss: 0.2777, LR: 0.000010
Epoch [45/250], Train Loss: 0.3147, Test Loss: 0.2771, LR: 0.000010
Epoch [46/250], Train Loss: 0.3136, Test Loss: 0.2816, LR: 0.000010
Epoch [47/250], Train Loss: 0.3120, Test Loss: 0.2780, LR: 0.000010
Epoch [48/250], Train Loss: 0.3126, Test Loss: 0.2707, LR: 0.000010
Epoch [49/250], Train Loss: 0.3093, Test Loss: 0.2822, LR: 0.000010
Epoch [50/250], Train Loss: 0.3107, Test Loss: 0.2698, LR: 0.000010
Epoch [51/250], Train Loss: 0.3087, Test Loss: 0.2697, LR: 0.000010
Epoch [52/250], Train Loss: 0.3072, Test Loss: 0.2814, LR: 0.000010
Epoch [53/250], Train Loss: 0.3055, Test Loss: 0.2764, LR: 0.000010
Epoch [54/250], Train Loss: 0.3059, Test Loss: 0.2757, LR: 0.000010
Epoch [55/250], Train Loss: 0.3034, Test Loss: 0.2769, LR: 0.000010
Epoch [56/250], Train Loss: 0.3034, Test Loss: 0.2735, LR: 0.000010
Epoch [57/250], Train Loss: 0.3025, Test Loss: 0.2757, LR: 0.000010
Epoch [58/250], Train Loss: 0.3016, Test Loss: 0.2667, LR: 0.000010
Epoch [59/250], Train Loss: 0.3015, Test Loss: 0.2683, LR: 0.000010
Epoch [60/250], Train Loss: 0.2990, Test Loss: 0.2630, LR: 0.000010
Epoch [61/250], Train Loss: 0.2982, Test Loss: 0.2630, LR: 0.000010
Epoch [62/250], Train Loss: 0.2974, Test Loss: 0.2713, LR: 0.000010
Epoch [63/250], Train Loss: 0.2971, Test Loss: 0.2684, LR: 0.000010
Epoch [64/250], Train Loss: 0.2975, Test Loss: 0.2673, LR: 0.000010
Epoch [65/250], Train Loss: 0.2970, Test Loss: 0.2638, LR: 0.000010
Epoch [66/250], Train Loss: 0.2944, Test Loss: 0.2694, LR: 0.000010
Epoch [67/250], Train Loss: 0.2938, Test Loss: 0.2571, LR: 0.000010
Epoch [68/250], Train Loss: 0.2941, Test Loss: 0.2632, LR: 0.000010
Epoch [69/250], Train Loss: 0.2926, Test Loss: 0.2618, LR: 0.000010
Epoch [70/250], Train Loss: 0.2922, Test Loss: 0.2614, LR: 0.000010
Epoch [71/250], Train Loss: 0.2927, Test Loss: 0.2647, LR: 0.000010
Epoch [72/250], Train Loss: 0.2921, Test Loss: 0.2598, LR: 0.000010
Epoch [73/250], Train Loss: 0.2903, Test Loss: 0.2660, LR: 0.000010
Epoch [74/250], Train Loss: 0.2907, Test Loss: 0.2681, LR: 0.000010
Epoch [75/250], Train Loss: 0.2903, Test Loss: 0.2581, LR: 0.000010
Epoch [76/250], Train Loss: 0.2913, Test Loss: 0.2607, LR: 0.000010
Epoch [77/250], Train Loss: 0.2881, Test Loss: 0.2557, LR: 0.000010
Epoch [78/250], Train Loss: 0.2869, Test Loss: 0.2620, LR: 0.000010
Epoch [79/250], Train Loss: 0.2875, Test Loss: 0.2687, LR: 0.000010
Epoch [80/250], Train Loss: 0.2877, Test Loss: 0.2613, LR: 0.000010
Epoch [81/250], Train Loss: 0.2877, Test Loss: 0.2578, LR: 0.000010
Epoch [82/250], Train Loss: 0.2853, Test Loss: 0.2500, LR: 0.000010
Epoch [83/250], Train Loss: 0.2862, Test Loss: 0.2581, LR: 0.000010
Epoch [84/250], Train Loss: 0.2837, Test Loss: 0.2594, LR: 0.000010
Epoch [85/250], Train Loss: 0.2829, Test Loss: 0.2560, LR: 0.000010
Epoch [86/250], Train Loss: 0.2840, Test Loss: 0.2586, LR: 0.000010
Epoch [87/250], Train Loss: 0.2820, Test Loss: 0.2607, LR: 0.000010
Epoch [88/250], Train Loss: 0.2824, Test Loss: 0.2600, LR: 0.000010
Epoch [89/250], Train Loss: 0.2824, Test Loss: 0.2706, LR: 0.000010
Epoch [90/250], Train Loss: 0.2821, Test Loss: 0.2593, LR: 0.000010
Epoch [91/250], Train Loss: 0.2806, Test Loss: 0.2530, LR: 0.000010
Epoch [92/250], Train Loss: 0.2818, Test Loss: 0.2550, LR: 0.000010
Epoch [93/250], Train Loss: 0.2808, Test Loss: 0.2449, LR: 0.000010
Epoch [94/250], Train Loss: 0.2805, Test Loss: 0.2529, LR: 0.000010
Epoch [95/250], Train Loss: 0.2798, Test Loss: 0.2535, LR: 0.000010
Epoch [96/250], Train Loss: 0.2801, Test Loss: 0.2488, LR: 0.000010
Epoch [97/250], Train Loss: 0.2796, Test Loss: 0.2491, LR: 0.000010
Epoch [98/250], Train Loss: 0.2786, Test Loss: 0.2641, LR: 0.000010
Epoch [99/250], Train Loss: 0.2793, Test Loss: 0.2476, LR: 0.000010
Epoch [100/250], Train Loss: 0.2769, Test Loss: 0.2426, LR: 0.000010
Epoch [101/250], Train Loss: 0.2787, Test Loss: 0.2454, LR: 0.000010
Epoch [102/250], Train Loss: 0.2772, Test Loss: 0.2469, LR: 0.000010
Epoch [103/250], Train Loss: 0.2759, Test Loss: 0.2448, LR: 0.000010
Epoch [104/250], Train Loss: 0.2766, Test Loss: 0.2518, LR: 0.000010
Epoch [105/250], Train Loss: 0.2753, Test Loss: 0.2400, LR: 0.000010
Epoch [106/250], Train Loss: 0.2742, Test Loss: 0.2442, LR: 0.000010
Epoch [107/250], Train Loss: 0.2754, Test Loss: 0.2443, LR: 0.000010
Epoch [108/250], Train Loss: 0.2749, Test Loss: 0.2521, LR: 0.000010
Epoch [109/250], Train Loss: 0.2749, Test Loss: 0.2528, LR: 0.000010
Epoch [110/250], Train Loss: 0.2729, Test Loss: 0.2514, LR: 0.000010
Epoch [111/250], Train Loss: 0.2734, Test Loss: 0.2453, LR: 0.000010
Epoch [112/250], Train Loss: 0.2728, Test Loss: 0.2383, LR: 0.000010
Epoch [113/250], Train Loss: 0.2719, Test Loss: 0.2477, LR: 0.000010
Epoch [114/250], Train Loss: 0.2721, Test Loss: 0.2453, LR: 0.000010
Epoch [115/250], Train Loss: 0.2714, Test Loss: 0.2527, LR: 0.000010
Epoch [116/250], Train Loss: 0.2715, Test Loss: 0.2389, LR: 0.000010
Epoch [117/250], Train Loss: 0.2716, Test Loss: 0.2426, LR: 0.000010
Epoch [118/250], Train Loss: 0.2725, Test Loss: 0.2531, LR: 0.000010
Epoch [119/250], Train Loss: 0.2707, Test Loss: 0.2356, LR: 0.000010
Epoch [120/250], Train Loss: 0.2728, Test Loss: 0.2451, LR: 0.000010
Epoch [121/250], Train Loss: 0.2714, Test Loss: 0.2481, LR: 0.000010
Epoch [122/250], Train Loss: 0.2702, Test Loss: 0.2386, LR: 0.000010
Epoch [123/250], Train Loss: 0.2699, Test Loss: 0.2429, LR: 0.000010
Epoch [124/250], Train Loss: 0.2693, Test Loss: 0.2478, LR: 0.000010
Epoch [125/250], Train Loss: 0.2703, Test Loss: 0.2472, LR: 0.000010
Epoch [126/250], Train Loss: 0.2687, Test Loss: 0.2426, LR: 0.000010
Epoch [127/250], Train Loss: 0.2693, Test Loss: 0.2362, LR: 0.000010
Epoch [128/250], Train Loss: 0.2684, Test Loss: 0.2529, LR: 0.000010
Epoch [129/250], Train Loss: 0.2670, Test Loss: 0.2463, LR: 0.000010
Epoch [130/250], Train Loss: 0.2686, Test Loss: 0.2364, LR: 0.000010
Epoch [131/250], Train Loss: 0.2673, Test Loss: 0.2378, LR: 0.000010
Epoch [132/250], Train Loss: 0.2670, Test Loss: 0.2402, LR: 0.000010
Epoch [133/250], Train Loss: 0.2695, Test Loss: 0.2476, LR: 0.000010
Epoch [134/250], Train Loss: 0.2663, Test Loss: 0.2379, LR: 0.000010
Epoch [135/250], Train Loss: 0.2690, Test Loss: 0.2382, LR: 0.000010
Epoch [136/250], Train Loss: 0.2681, Test Loss: 0.2428, LR: 0.000010
Epoch [137/250], Train Loss: 0.2661, Test Loss: 0.2472, LR: 0.000010
Epoch [138/250], Train Loss: 0.2674, Test Loss: 0.2410, LR: 0.000010
Epoch [139/250], Train Loss: 0.2660, Test Loss: 0.2391, LR: 0.000010
Epoch [140/250], Train Loss: 0.2667, Test Loss: 0.2304, LR: 0.000010
Epoch [141/250], Train Loss: 0.2661, Test Loss: 0.2547, LR: 0.000010
Epoch [142/250], Train Loss: 0.2649, Test Loss: 0.2442, LR: 0.000010
Epoch [143/250], Train Loss: 0.2668, Test Loss: 0.2369, LR: 0.000010
Epoch [144/250], Train Loss: 0.2653, Test Loss: 0.2371, LR: 0.000010
Epoch [145/250], Train Loss: 0.2643, Test Loss: 0.2376, LR: 0.000010
Epoch [146/250], Train Loss: 0.2638, Test Loss: 0.2384, LR: 0.000010
Epoch [147/250], Train Loss: 0.2651, Test Loss: 0.2485, LR: 0.000010
Epoch [148/250], Train Loss: 0.2632, Test Loss: 0.2292, LR: 0.000010
Epoch [149/250], Train Loss: 0.2640, Test Loss: 0.2418, LR: 0.000010
Epoch [150/250], Train Loss: 0.2642, Test Loss: 0.2519, LR: 0.000010
Epoch [151/250], Train Loss: 0.2628, Test Loss: 0.2302, LR: 0.000010
Epoch [152/250], Train Loss: 0.2632, Test Loss: 0.2518, LR: 0.000010
Epoch [153/250], Train Loss: 0.2627, Test Loss: 0.2313, LR: 0.000010
Epoch [154/250], Train Loss: 0.2629, Test Loss: 0.2322, LR: 0.000010
Epoch [155/250], Train Loss: 0.2619, Test Loss: 0.2387, LR: 0.000010
Epoch [156/250], Train Loss: 0.2630, Test Loss: 0.2315, LR: 0.000010
Epoch [157/250], Train Loss: 0.2616, Test Loss: 0.2288, LR: 0.000010
Epoch [158/250], Train Loss: 0.2642, Test Loss: 0.2340, LR: 0.000010
Epoch [159/250], Train Loss: 0.2623, Test Loss: 0.2284, LR: 0.000010
Epoch [160/250], Train Loss: 0.2623, Test Loss: 0.2406, LR: 0.000010
Epoch [161/250], Train Loss: 0.2615, Test Loss: 0.2406, LR: 0.000010
Epoch [162/250], Train Loss: 0.2636, Test Loss: 0.2358, LR: 0.000010
Epoch [163/250], Train Loss: 0.2612, Test Loss: 0.2242, LR: 0.000010
Epoch [164/250], Train Loss: 0.2615, Test Loss: 0.2364, LR: 0.000010
Epoch [165/250], Train Loss: 0.2615, Test Loss: 0.2336, LR: 0.000010
Epoch [166/250], Train Loss: 0.2592, Test Loss: 0.2331, LR: 0.000010
Epoch [167/250], Train Loss: 0.2598, Test Loss: 0.2309, LR: 0.000010
Epoch [168/250], Train Loss: 0.2602, Test Loss: 0.2482, LR: 0.000010
Epoch [169/250], Train Loss: 0.2605, Test Loss: 0.2315, LR: 0.000010
Epoch [170/250], Train Loss: 0.2611, Test Loss: 0.2299, LR: 0.000010
Epoch [171/250], Train Loss: 0.2597, Test Loss: 0.2387, LR: 0.000010
Epoch [172/250], Train Loss: 0.2587, Test Loss: 0.2381, LR: 0.000010
Epoch [173/250], Train Loss: 0.2594, Test Loss: 0.2341, LR: 0.000010
Epoch [174/250], Train Loss: 0.2600, Test Loss: 0.2283, LR: 0.000010
Epoch [175/250], Train Loss: 0.2597, Test Loss: 0.2394, LR: 0.000010
Epoch [176/250], Train Loss: 0.2590, Test Loss: 0.2284, LR: 0.000010
Epoch [177/250], Train Loss: 0.2608, Test Loss: 0.2243, LR: 0.000010
Epoch [178/250], Train Loss: 0.2590, Test Loss: 0.2357, LR: 0.000010
Epoch [179/250], Train Loss: 0.2588, Test Loss: 0.2364, LR: 0.000010
Epoch [180/250], Train Loss: 0.2588, Test Loss: 0.2422, LR: 0.000010
Epoch [181/250], Train Loss: 0.2599, Test Loss: 0.2389, LR: 0.000010
Epoch [182/250], Train Loss: 0.2588, Test Loss: 0.2221, LR: 0.000010
Epoch [183/250], Train Loss: 0.2573, Test Loss: 0.2378, LR: 0.000010
Epoch [184/250], Train Loss: 0.2570, Test Loss: 0.2338, LR: 0.000010
Epoch [185/250], Train Loss: 0.2576, Test Loss: 0.2366, LR: 0.000010
Epoch [186/250], Train Loss: 0.2585, Test Loss: 0.2385, LR: 0.000010
Epoch [187/250], Train Loss: 0.2563, Test Loss: 0.2342, LR: 0.000010
Epoch [188/250], Train Loss: 0.2573, Test Loss: 0.2271, LR: 0.000010
Epoch [189/250], Train Loss: 0.2560, Test Loss: 0.2330, LR: 0.000010
Epoch [190/250], Train Loss: 0.2572, Test Loss: 0.2269, LR: 0.000010
Epoch [191/250], Train Loss: 0.2571, Test Loss: 0.2352, LR: 0.000010
Epoch [192/250], Train Loss: 0.2574, Test Loss: 0.2471, LR: 0.000010
Epoch [193/250], Train Loss: 0.2569, Test Loss: 0.2271, LR: 0.000010
Epoch [194/250], Train Loss: 0.2581, Test Loss: 0.2260, LR: 0.000010
Epoch [195/250], Train Loss: 0.2545, Test Loss: 0.2294, LR: 0.000010
Epoch [196/250], Train Loss: 0.2558, Test Loss: 0.2320, LR: 0.000010
Epoch [197/250], Train Loss: 0.2566, Test Loss: 0.2283, LR: 0.000010
Epoch [198/250], Train Loss: 0.2562, Test Loss: 0.2263, LR: 0.000010
Epoch [199/250], Train Loss: 0.2561, Test Loss: 0.2318, LR: 0.000010
Epoch [200/250], Train Loss: 0.2563, Test Loss: 0.2382, LR: 0.000010
Epoch [201/250], Train Loss: 0.2563, Test Loss: 0.2424, LR: 0.000010
Epoch [202/250], Train Loss: 0.2554, Test Loss: 0.2291, LR: 0.000010
Epoch [203/250], Train Loss: 0.2562, Test Loss: 0.2325, LR: 0.000005
Epoch [204/250], Train Loss: 0.2555, Test Loss: 0.2365, LR: 0.000005
Epoch [205/250], Train Loss: 0.2554, Test Loss: 0.2318, LR: 0.000005
Epoch [206/250], Train Loss: 0.2560, Test Loss: 0.2231, LR: 0.000005
Epoch [207/250], Train Loss: 0.2538, Test Loss: 0.2339, LR: 0.000005
Epoch [208/250], Train Loss: 0.2570, Test Loss: 0.2361, LR: 0.000005
Epoch [209/250], Train Loss: 0.2547, Test Loss: 0.2378, LR: 0.000005
Epoch [210/250], Train Loss: 0.2542, Test Loss: 0.2318, LR: 0.000005
Epoch [211/250], Train Loss: 0.2547, Test Loss: 0.2338, LR: 0.000005
Epoch [212/250], Train Loss: 0.2541, Test Loss: 0.2402, LR: 0.000005
Epoch [213/250], Train Loss: 0.2552, Test Loss: 0.2254, LR: 0.000005
Epoch [214/250], Train Loss: 0.2549, Test Loss: 0.2297, LR: 0.000005
Epoch [215/250], Train Loss: 0.2532, Test Loss: 0.2330, LR: 0.000005
Epoch [216/250], Train Loss: 0.2540, Test Loss: 0.2274, LR: 0.000005
Epoch [217/250], Train Loss: 0.2542, Test Loss: 0.2317, LR: 0.000005
Epoch [218/250], Train Loss: 0.2534, Test Loss: 0.2246, LR: 0.000005
Epoch [219/250], Train Loss: 0.2534, Test Loss: 0.2256, LR: 0.000005
Epoch [220/250], Train Loss: 0.2533, Test Loss: 0.2386, LR: 0.000005
Epoch [221/250], Train Loss: 0.2544, Test Loss: 0.2220, LR: 0.000005
Epoch [222/250], Train Loss: 0.2538, Test Loss: 0.2375, LR: 0.000005
Early stopping en época 222
Entrenamiento de la Red Neuronal completado.
Tiempo de entrenamiento de la Red Neuronal: 159.63 segundos
Mejor pérdida de validación: 0.2221
In [39]
# Guardamos el modelo para su reproducibilidad

# processor (fitted on X_train)
preprocessor_filename_nn = models_path + '/nn_preprocessor.joblib'
joblib.dump(preprocessor, preprocessor_filename_nn)
print(f"Preprocesador de NN guardado en: {preprocessor_filename_nn}")

# y_scaler (fitted on y_train)
y_scaler_filename_nn = models_path + '/nn_y_scaler.joblib'
joblib.dump(y_scaler, y_scaler_filename_nn)
print(f"Escalador de targets de NN guardado en: {y_scaler_filename_nn}")

# Parámetros aprendidos por el modelo de Red Neuronal.
model_state_filename_nn = models_path + '/nn_model_state.pth'
torch.save(model.state_dict(), model_state_filename_nn)
print(f"Estado del modelo NN guardado en: {model_state_filename_nn}")
Preprocesador de NN guardado en: ../Modelo_Entrenado//nn_preprocessor.joblib
Escalador de targets de NN guardado en: ../Modelo_Entrenado//nn_y_scaler.joblib
Estado del modelo NN guardado en: ../Modelo_Entrenado//nn_model_state.pth
In [40]
plt.figure(figsize=(10, 6))
plt.plot(train_losses, label='Training Loss', color=naturgy_orange_palette[0], linewidth=2)
plt.plot(test_losses, label='Test/Validation Loss', color=naturgy_blue_palette[0], linewidth=2)
plt.title('Curvas de Aprendizaje de la Red Neuronal', fontsize=14, color=naturgy_blue_palette[0], fontweight='bold')
plt.xlabel('Epochs', fontsize=12, color=naturgy_blue_palette[1])
plt.ylabel('Loss (MSE)', fontsize=12, color=naturgy_blue_palette[1])
plt.grid(True, alpha=0.3, linestyle='--')
plt.legend(fontsize=10)

best_epoch = test_losses.index(min(test_losses))
plt.axvline(x=best_epoch, color=naturgy_blue_palette[2], linestyle='--', 
            label=f'Mejor modelo (época {best_epoch+1})')

plt.tight_layout()
plt.show()

La gráfica se ve bastante bien, tanto training como test bajan de forma rápida indicando que el modelo está generalizado bien.

Probé con otro tipo de redes neuronales más específicos para series temporales, pero o por falta de lags, rollings etc o porque no preparo bien los datos los resultados no mejoran, eso sí TSLM da gráficas de loss vs epoch bastante estables.

Funciones útiles para cargar y visualizar modelos

In [41]
def evaluate_model(y_true, y_pred, model_name):
    results = {
        target: {
            'R2': r2_score(y_true[target], y_pred[:, i]),
            'RMSE': np.sqrt(mean_squared_error(y_true[target], y_pred[:, i])),
            'MAE': mean_absolute_error(y_true[target], y_pred[:, i]),
            'sMAPE': np.mean(np.abs((y_true[target] - y_pred[:, i]) / (y_true[target] + 1e-9))) * 100
        }
        for i, target in enumerate(y_true.columns)
    }
    df_results = pd.DataFrame(results).T
    print(f"--- Resultados para {model_name} ---")
    display(df_results)
    return df_results
In [42]
def predict_with_neural_network(nn_model, X_data, nn_X_preprocessor, nn_y_scaler):
    """
    Realiza predicciones usando un modelo de red neuronal PyTorch,
    incluyendo preprocesamiento de X y escalado inverso de y.
    """
    if nn_model is None or nn_X_preprocessor is None or nn_y_scaler is None:
        print("Error: El modelo NN o sus componentes no están disponibles para la predicción.")
        return None

    # 1. Preprocesar X_data con el preprocesador de la NN
    X_data_scaled = nn_X_preprocessor.transform(X_data)
    
    # 2. Realizar predicciones con el modelo NN
    nn_model.eval()
    with torch.no_grad():
        y_pred_scaled = nn_model(torch.tensor(X_data_scaled, dtype=torch.float32)).numpy()
        
    # 3. Revertir el escalado de las predicciones
    y_pred_original_scale = nn_y_scaler.inverse_transform(y_pred_scaled)
    
    return y_pred_original_scale
In [43]
def generate_comparison_plots(y_true_df, y_pred_xgb_data, y_pred_nn_data, dataset_name):
    """
    Genera y muestra gráficos de comparación para valores reales vs. predichos.
    """
    y_pred_xgb_df = None
    if y_pred_xgb_data is not None:
        y_pred_xgb_df = pd.DataFrame(y_pred_xgb_data, columns=y_true_df.columns, index=y_true_df.index)

    y_pred_nn_df = None
    if y_pred_nn_data is not None:
        y_pred_nn_df = pd.DataFrame(y_pred_nn_data, columns=y_true_df.columns, index=y_true_df.index)

    num_targets = len(y_true_df.columns)
    fig_height = max(15, num_targets * 5)
    fig, axes = plt.subplots(num_targets, 1, figsize=(18, fig_height), sharex=True)
    
    if num_targets == 1:
        axes = [axes]

    for i, target_col in enumerate(y_true_df.columns):
        ax = axes[i]
        y_true_df[target_col].plot(ax=ax, label=f'Valor Real ({dataset_name})', color='black', alpha=0.7)
        if y_pred_xgb_df is not None:
            y_pred_xgb_df[target_col].plot(ax=ax, label=f'Pred. XGBoost ({dataset_name})', 
                                          color=naturgy_blue_palette[0], linestyle='--')
        if y_pred_nn_df is not None:
            y_pred_nn_df[target_col].plot(ax=ax, label=f'Pred. PyTorch ({dataset_name})', 
                                         color=naturgy_orange_palette[0], linestyle=':')
        
        ax.set_title(f'Comparación de Predicciones para {target_col} ({dataset_name})', 
                    fontsize=12, color=naturgy_blue_palette[0], fontweight='bold')
        ax.legend(fontsize=10)
        ax.grid(True, alpha=0.3, linestyle='--')
        ax.set_ylabel('Valor', fontsize=10, color=naturgy_blue_palette[1])
        
    plt.tight_layout()
    plt.show()
In [44]
def load_xgb_pipeline_model(model_path):
    """Carga un pipeline de XGBoost guardado con joblib."""
    try:
        pipeline = joblib.load(model_path)
        print(f"Pipeline XGBoost cargado desde: {model_path}")
        return pipeline
    except FileNotFoundError:
        print(f"ERROR: No se encontró el archivo del pipeline XGBoost: {model_path}")
        return None

Resultados (en Conjunto de Validación)

In [45]
y_pred_xgb_val = xgb_pipeline.predict(X_val) 
y_pred_nn_val = predict_with_neural_network(model, X_val, preprocessor, y_scaler)

# --- Evaluación de Modelos en el CONJUNTO DE VALIDACIÓN ---
if y_pred_xgb_val is not None:
    results_xgb_val = evaluate_model(y_val, y_pred_xgb_val, 'XGBoost (Validación)')
if y_pred_nn_val is not None:
    results_nn_val = evaluate_model(y_val, y_pred_nn_val, 'Red Neuronal (PyTorch) (Validación)')

print("\n--- Comparativa de Tiempos de Entrenamiento ---")
training_times_data = {
    'Modelo': ['XGBoost', 'Red Neuronal (PyTorch)'],
    'Tiempo de Entrenamiento (s)': [training_time_xgb, training_time_nn]
}

df_training_times = pd.DataFrame(training_times_data)
display(df_training_times)

print("\n--- Gráficos de Comparación en el Conjunto de Validación ---")
generate_comparison_plots(y_val, y_pred_xgb_val, y_pred_nn_val, "Validación")
--- Resultados para XGBoost (Validación) ---
R2 RMSE MAE sMAPE
renewable_H_plus_6 0.796624 1867.445213 1362.814661 12.291641
load_H_plus_6 0.897553 1440.722241 1004.700937 3.495286
renewable_H_plus_1 0.969492 723.183029 434.534922 3.960992
load_H_plus_1 0.986647 519.790706 313.268580 1.093828
price_H_plus_1 0.960352 2.459631 1.810902 3.101360
--- Resultados para Red Neuronal (PyTorch) (Validación) ---
R2 RMSE MAE sMAPE
renewable_H_plus_6 0.798433 1859.120601 1357.319815 12.387131
load_H_plus_6 0.810376 1960.096825 1481.159004 5.060964
renewable_H_plus_1 0.956510 863.453177 590.560515 5.389653
load_H_plus_1 0.962651 869.304349 648.637170 2.283899
price_H_plus_1 0.939531 3.037564 2.335380 4.037698
--- Comparativa de Tiempos de Entrenamiento ---
Modelo Tiempo de Entrenamiento (s)
0 XGBoost 177.591160
1 Red Neuronal (PyTorch) 159.627548
--- Gráficos de Comparación en el Conjunto de Validación ---

Conclusiones de los modelos

  • Los resultados para el Horizonte de predicciones de H+1 son más precisos que los calculados para H+6.
  • Por lo general XGBoost tiene mejor rendimiento en casi todos los apartados R2 superior etc.
  • La variable más difícil de calcular para ambos modelos es la Generación de Renovables + 6H posiblemente porque depende mucho de factores difíciles de calcular como puede ser la generación eólica y solar.
  • El tiempo de entrenamiento es similar en ambos modelos.

En general el modelo más robusto con los datos de validación es XGBoost

8 Prueba Final

Carga de Modelos y Preprocesadores

In [46]
def load_neural_network_components(preprocessor_path, y_scaler_path, model_state_path, 
                                   ModelClass, input_size, output_size):
    """
    Carga un modelo de Red Neuronal PyTorch y sus componentes asociados.
    ModelClass es la clase del modelo (ej. MLP).
    input_size y output_size son las dimensiones necesarias para instanciar ModelClass.
    """
    
    loaded_preprocessor = joblib.load(preprocessor_path)
    print(f"Preprocesador de NN cargado desde: {preprocessor_path}")
    
    loaded_y_scaler = joblib.load(y_scaler_path)
    print(f"Escalador de targets de NN cargado desde: {y_scaler_path}")
    
    loaded_model = ModelClass(input_size, output_size)
    loaded_model.load_state_dict(torch.load(model_state_path))
    loaded_model.eval()
    print(f"Estado del modelo NN cargado desde: {model_state_path}")
        
    return loaded_model, loaded_preprocessor, loaded_y_scaler
In [47]
print("\n--- Cargando Modelos y Preprocesadores Guardados ---")
model_filename_xgb = models_path + '/xgb_model-energy_weather.joblib'
loaded_xgb_pipeline = load_xgb_pipeline_model(model_filename_xgb)

preprocessor_filename_nn = models_path + '/nn_preprocessor.joblib'
y_scaler_filename_nn = models_path + '/nn_y_scaler.joblib'
model_state_filename_nn = models_path + '/nn_model_state.pth'
--- Cargando Modelos y Preprocesadores Guardados ---
Pipeline XGBoost cargado desde: ../Modelo_Entrenado//xgb_model-energy_weather.joblib
In [48]
try:
    numeric_features_list = X_train.select_dtypes(include=np.number).columns.tolist() 
    actual_input_size_nn = len(numeric_features_list)
    actual_output_size_nn = y_train.shape[1]
    print(f"Determinadas dimensiones para MLP: input={actual_input_size_nn}, output={actual_output_size_nn}")
except NameError:
    print("ADVERTENCIA: X_train o y_train no están definidos. Debes establecer actual_input_size_nn y actual_output_size_nn manualmente.")
    actual_input_size_nn = 0 
    actual_output_size_nn = 0


loaded_nn_model, loaded_nn_preprocessor, loaded_nn_y_scaler = load_neural_network_components(
    preprocessor_filename_nn, 
    y_scaler_filename_nn, 
    model_state_filename_nn,
    MLP,
    actual_input_size_nn, 
    actual_output_size_nn 
)
Determinadas dimensiones para MLP: input=181, output=5
Preprocesador de NN cargado desde: ../Modelo_Entrenado//nn_preprocessor.joblib
Escalador de targets de NN cargado desde: ../Modelo_Entrenado//nn_y_scaler.joblib
Estado del modelo NN cargado desde: ../Modelo_Entrenado//nn_model_state.pth

Predicciones y Evaluación en el Conjunto de Prueba Final

In [49]
print("\n--- Realizando predicciones en el CONJUNTO DE PRUEBA FINAL ---")

y_pred_xgb_final = loaded_xgb_pipeline.predict(X_final_test)

y_pred_nn_final = predict_with_neural_network(
    loaded_nn_model, X_final_test, loaded_nn_preprocessor, loaded_nn_y_scaler
)
results_xgb_final = evaluate_model(y_final_test, y_pred_xgb_final, 'XGBoost - Prueba Final')
results_nn_final = evaluate_model(y_final_test, y_pred_nn_final, 'Red Neuronal - Prueba Final')

print("\n--- Gráficos de Comparación en el Conjunto de Prueba Final ---")
generate_comparison_plots(y_final_test, y_pred_xgb_final, y_pred_nn_final, "Prueba Final")

print("\n--- Prueba Final Completada ---")
--- Realizando predicciones en el CONJUNTO DE PRUEBA FINAL ---
--- Resultados para XGBoost - Prueba Final ---
R2 RMSE MAE sMAPE
renewable_H_plus_6 0.823469 1484.932725 1186.277652 11.958129
load_H_plus_6 0.905611 1511.633135 1121.651317 4.124332
renewable_H_plus_1 0.974325 565.720860 402.244674 4.114768
load_H_plus_1 0.988410 533.791674 336.940117 1.280638
price_H_plus_1 0.933953 1.836295 1.400207 2.104182
--- Resultados para Red Neuronal - Prueba Final ---
R2 RMSE MAE sMAPE
renewable_H_plus_6 0.847161 1381.698408 1089.494666 11.027053
load_H_plus_6 0.846812 1925.743498 1428.098706 5.109202
renewable_H_plus_1 0.964282 667.256410 499.152064 5.006259
load_H_plus_1 0.974265 795.417798 610.786716 2.239124
price_H_plus_1 0.897759 2.284692 1.781431 2.662419
--- Gráficos de Comparación en el Conjunto de Prueba Final ---
--- Prueba Final Completada ---

Conclusiones Finales del Rendimiento de los Modelos

  • XGBoost definitivamente es el modelo más completo para nuestros objetivos. Probablemente su mejor rendimiento se debe a que a diferencia de la red neuronal no necesita una ingeniería tan refinada como la que podría necesitar una rede neuronal. A parte XGBoost es buena trantando datos tabulares estructurados.

  • La red neuronal destaca por ser más preciso en las renovables + 6h probablemente a que encuentra patrones que XGBoost no puede ya que es la variable más dificl de calcular.

  • Aunque el modelo tiene unos resultados que se podrían llevar a producción, a lo largo del tiempo estos modelos se podría refinar a medida que se van alimentando con más datos.