Naturgy — EDA & Forecast Models
Exploratory analysis, feature engineering, and a head-to-head between XGBoost and a Keras NN on Spanish electricity (2015–2018).

Análisis Exploratorio de Datos (EDA) y Modelado Predictivo para Naturgy: Impulsando el Futuro Energético
Índice del Notebook
Este notebook detalla el desarrollo y la validación de modelos predictivos avanzados diseñados para proporcionar a Naturgy información crítica sobre tres pilares fundamentales del sistema energético: la demanda total de energía (total load actual), el precio del mercado eléctrico (price actual) y la generación de energía renovable (total renewable generation).
Hemos logrado construir modelos capaces de generar pronósticos fiables para estas variables con horizontes de 1 a 6 horas de antelación. Esta capacidad predictiva se traduce directamente en ventajas competitivas y operativas significativas para Naturgy:
Optimización de la Demanda y Generación: Al predecir la
total load actualcon varias horas de antelación, Naturgy puede optimizar el despacho de sus activos de generación, gestionar eficientemente las reservas energéticas y mejorar la estabilidad de la red, resultando en una operación más económica y segura.Estrategia de Mercado y Gestión de Riesgos: Con pronósticos precisos del
price actual, Naturgy puede afinar sus estrategias de participación en el mercado mayorista, optimizar la compra y venta de energía, y gestionar de manera más efectiva los riesgos asociados a la volatilidad de los precios, beneficiando tanto sus operaciones como las ofertas a sus clientes.Integración Eficiente de Energías Renovables: La predicción de la
total renewable generationpermite a Naturgy anticipar la producción de sus fuentes renovables y de las del sistema en general. Esto es clave para gestionar la intermitencia, optimizar el uso de sistemas de respaldo y almacenamiento, y facilitar una integración más fluida y rentable de las energías limpias en el mix energético, alineándose con los objetivos de sostenibilidad.
1 Exploración Inicial y Limpieza de Datos
Librerías
import time
import warnings
import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import torch
import torch.nn as nn
import torch.optim as optim
import xgboost as xgb
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from torch.utils.data import TensorDataset, DataLoader
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (18, 8)Paleta de colores
naturgy_orange_palette = ['#e37404', '#efaa66', '#f0b983']
naturgy_blue_palette = ['#044474', '#628baa', '#9bb3c4']Carga y Análisis Individual de Datasets
Dataset de Energía Mix Energético Español
df_energy = pd.read_parquet('../data/energy_silver.parquet/')
print(f"Dataset de Energía cargado con dimensiones: {df_energy.shape}")Dataset de Energía cargado con dimensiones: (35046, 29)
display(df_energy.head())
display(df_energy.info())
display(df_energy.describe())| time | generation biomass | generation fossil brown coal or lignite | generation fossil coal-derived gas | generation fossil gas | generation fossil hard coal | generation fossil oil | generation fossil oil shale | generation fossil peat | generation geothermal | ... | generation waste | generation wind offshore | generation wind onshore | forecast solar day ahead | forecast wind offshore eday ahead | forecast wind onshore day ahead | total load forecast | total load actual | price day ahead | price actual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-02-26 13:00:00 | 472.0 | 0.0 | 0.0 | 3524.0 | 2492.0 | 361.0 | 0.0 | 0.0 | 0.0 | ... | 202.0 | 0.0 | 10524.0 | 3599.0 | Unknown | 10387.0 | 33607.0 | 33320.0 | 41.10 | 53.80 |
| 1 | 2015-04-30 07:00:00 | 382.0 | 647.0 | 0.0 | 4907.0 | 6307.0 | 363.0 | 0.0 | 0.0 | 0.0 | ... | 185.0 | 0.0 | 4478.0 | 2132.0 | Unknown | 4514.0 | 30485.0 | 30581.0 | 59.99 | 71.36 |
| 2 | 2015-06-13 20:00:00 | 486.0 | 980.0 | 0.0 | 4444.0 | 6650.0 | 249.0 | 0.0 | 0.0 | 0.0 | ... | 280.0 | 0.0 | 3743.0 | 396.0 | Unknown | 3981.0 | 27127.0 | 26635.0 | 62.13 | 72.12 |
| 3 | 2015-06-23 19:00:00 | 496.0 | 538.0 | 0.0 | 4818.0 | 7598.0 | 369.0 | 0.0 | 0.0 | 0.0 | ... | 268.0 | 0.0 | 5692.0 | 1082.0 | Unknown | 5858.0 | 30816.0 | 29731.0 | 62.21 | 75.64 |
| 4 | 2015-06-30 23:00:00 | 497.0 | 936.0 | 0.0 | 5253.0 | 6619.0 | 274.0 | 0.0 | 0.0 | 0.0 | ... | 237.0 | 0.0 | 1971.0 | 2.0 | Unknown | 2174.0 | 22428.0 | 23341.0 | 49.67 | 60.52 |
5 rows × 29 columns
<class 'pandas.core.frame.DataFrame'> RangeIndex: 35046 entries, 0 to 35045 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 time 35046 non-null datetime64[ns] 1 generation biomass 35046 non-null float64 2 generation fossil brown coal or lignite 35046 non-null float64 3 generation fossil coal-derived gas 35046 non-null float64 4 generation fossil gas 35046 non-null float64 5 generation fossil hard coal 35046 non-null float64 6 generation fossil oil 35046 non-null float64 7 generation fossil oil shale 35046 non-null float64 8 generation fossil peat 35046 non-null float64 9 generation geothermal 35046 non-null float64 10 generation hydro pumped storage aggregated 35046 non-null object 11 generation hydro pumped storage consumption 35046 non-null float64 12 generation hydro run-of-river and poundage 35046 non-null float64 13 generation hydro water reservoir 35046 non-null float64 14 generation marine 35046 non-null float64 15 generation nuclear 35046 non-null float64 16 generation other 35046 non-null float64 17 generation other renewable 35046 non-null float64 18 generation solar 35046 non-null float64 19 generation waste 35046 non-null float64 20 generation wind offshore 35046 non-null float64 21 generation wind onshore 35046 non-null float64 22 forecast solar day ahead 35046 non-null float64 23 forecast wind offshore eday ahead 35046 non-null object 24 forecast wind onshore day ahead 35046 non-null float64 25 total load forecast 35046 non-null float64 26 total load actual 35046 non-null float64 27 price day ahead 35046 non-null float64 28 price actual 35046 non-null float64 dtypes: datetime64[ns](1), float64(26), object(2) memory usage: 7.8+ MB
None
| time | generation biomass | generation fossil brown coal or lignite | generation fossil coal-derived gas | generation fossil gas | generation fossil hard coal | generation fossil oil | generation fossil oil shale | generation fossil peat | generation geothermal | ... | generation solar | generation waste | generation wind offshore | generation wind onshore | forecast solar day ahead | forecast wind onshore day ahead | total load forecast | total load actual | price day ahead | price actual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 35046 | 35046.000000 | 35046.000000 | 35046.0 | 35046.000000 | 35046.000000 | 35046.000000 | 35046.0 | 35046.0 | 35046.0 | ... | 35046.000000 | 35046.000000 | 35046.0 | 35046.000000 | 35046.000000 | 35046.000000 | 35046.000000 | 35046.000000 | 35046.000000 | 35046.000000 |
| mean | 2016-12-31 17:28:37.822290688 | 383.513011 | 448.059208 | 0.0 | 5622.737488 | 4256.065742 | 298.319808 | 0.0 | 0.0 | 0.0 | ... | 1432.665925 | 269.452377 | 0.0 | 5464.479769 | 1438.787337 | 5470.675056 | 28711.264281 | 28696.326628 | 49.873031 | 57.875419 |
| min | 2014-12-31 23:00:00 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 237.000000 | 18105.000000 | 18041.000000 | 2.060000 | 9.330000 |
| 25% | 2016-01-01 16:15:00 | 333.000000 | 0.000000 | 0.0 | 4126.000000 | 2527.000000 | 263.000000 | 0.0 | 0.0 | 0.0 | ... | 71.000000 | 240.000000 | 0.0 | 2933.000000 | 69.000000 | 2979.000000 | 24795.000000 | 24809.250000 | 41.490000 | 49.340000 |
| 50% | 2016-12-31 18:30:00 | 367.000000 | 509.000000 | 0.0 | 4969.000000 | 4474.000000 | 300.000000 | 0.0 | 0.0 | 0.0 | ... | 616.000000 | 279.000000 | 0.0 | 4849.000000 | 576.000000 | 4853.000000 | 28905.000000 | 28896.000000 | 50.520000 | 58.010000 |
| 75% | 2017-12-31 19:45:00 | 433.000000 | 757.000000 | 0.0 | 6429.000000 | 5838.750000 | 330.000000 | 0.0 | 0.0 | 0.0 | ... | 2578.000000 | 310.000000 | 0.0 | 7398.000000 | 2635.000000 | 7350.750000 | 32263.000000 | 32187.000000 | 60.530000 | 68.000000 |
| max | 2018-12-31 22:00:00 | 592.000000 | 999.000000 | 0.0 | 20034.000000 | 8359.000000 | 449.000000 | 0.0 | 0.0 | 0.0 | ... | 5792.000000 | 357.000000 | 0.0 | 17436.000000 | 5836.000000 | 17430.000000 | 41390.000000 | 41015.000000 | 101.990000 | 116.800000 |
| std | NaN | 85.352783 | 354.568590 | 0.0 | 2201.830478 | 1961.601013 | 52.519923 | 0.0 | 0.0 | 0.0 | ... | 1680.119887 | 50.194840 | 0.0 | 3213.691587 | 1677.668887 | 3176.376641 | 4592.465001 | 4572.643285 | 14.617608 | 14.201153 |
8 rows × 27 columns
df_energy.isnull().sum()time 0 generation biomass 0 generation fossil brown coal or lignite 0 generation fossil coal-derived gas 0 generation fossil gas 0 generation fossil hard coal 0 generation fossil oil 0 generation fossil oil shale 0 generation fossil peat 0 generation geothermal 0 generation hydro pumped storage aggregated 0 generation hydro pumped storage consumption 0 generation hydro run-of-river and poundage 0 generation hydro water reservoir 0 generation marine 0 generation nuclear 0 generation other 0 generation other renewable 0 generation solar 0 generation waste 0 generation wind offshore 0 generation wind onshore 0 forecast solar day ahead 0 forecast wind offshore eday ahead 0 forecast wind onshore day ahead 0 total load forecast 0 total load actual 0 price day ahead 0 price actual 0 dtype: int64
Observaciones: Las fuentes renovables son las que más variabilidad tiene.
Por otra parte hay fuentes que no tienen variabilidad lo que posiblemente pueden eliminarse.
Hay dos columnas de tipo object, teniendo en cuenta que es una tabla de mediciones probablemente se puedan eliminar o estén etiquetadas así por algo durante el proceso de ETL en la capa bronze.
No se aprecian valores nulos.
Dataset de Clima
df_weather = pd.read_parquet('../data/weather_features_silver.parquet')
print(f"Dataset de Clima cargado con dimensiones: {df_weather.shape}")Dataset de Clima cargado con dimensiones: (175271, 15)
display(df_weather.head())
display(df_weather.info())
display(df_weather.describe())| time | city_name | temp | temp_min | temp_max | pressure | humidity | wind_speed | wind_deg | rain_1h | rain_3h | snow_3h | clouds_all | weather_main | weather_description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-01-16 04:00:00 | Valencia | 12.100 | 12.100 | 12.100 | 1009 | 61 | 21.6 | 242 | 0.0 | 0.0 | 0.0 | 76 | clouds | broken clouds |
| 1 | 2015-01-26 20:00:00 | Valencia | 3.748 | 3.748 | 3.748 | 1027 | 77 | 3.6 | 46 | 0.0 | 0.0 | 0.0 | 36 | clouds | scattered clouds |
| 2 | 2015-02-24 20:00:00 | Seville | 9.962 | 9.962 | 9.962 | 1039 | 62 | 14.4 | 349 | 0.0 | 0.0 | 0.0 | 0 | clear | sky is clear |
| 3 | 2015-03-01 00:00:00 | Madrid | 14.800 | 14.800 | 14.800 | 942 | 89 | 7.2 | 252 | 0.0 | 0.0 | 0.0 | 8 | clear | sky is clear |
| 4 | 2015-03-01 19:00:00 | Barcelona | 16.600 | 16.600 | 16.600 | 1015 | 90 | 7.2 | 225 | 0.0 | 0.0 | 0.0 | 0 | clear | sky is clear |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 175271 entries, 0 to 175270 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 time 175271 non-null datetime64[ns] 1 city_name 175271 non-null object 2 temp 175271 non-null float64 3 temp_min 175271 non-null float64 4 temp_max 175271 non-null float64 5 pressure 175271 non-null int32 6 humidity 175271 non-null int32 7 wind_speed 175271 non-null float64 8 wind_deg 175271 non-null int32 9 rain_1h 175271 non-null float64 10 rain_3h 175271 non-null float64 11 snow_3h 175271 non-null float64 12 clouds_all 175271 non-null int32 13 weather_main 175271 non-null object 14 weather_description 175271 non-null object dtypes: datetime64[ns](1), float64(7), int32(4), object(3) memory usage: 17.4+ MB
None
| time | temp | temp_min | temp_max | pressure | humidity | wind_speed | wind_deg | rain_1h | rain_3h | snow_3h | clouds_all | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 175271 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 |
| mean | 2016-12-31 14:55:28.938614784 | 16.559508 | 15.279950 | 18.024507 | 1016.215871 | 68.050339 | 8.889137 | 166.727097 | 0.068966 | 0.000386 | 0.004848 | 24.343052 |
| min | 2014-12-31 23:00:00 | -10.910000 | -10.910000 | -10.910000 | 918.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2016-01-01 12:00:00 | 10.681000 | 9.639750 | 11.777594 | 1013.000000 | 53.000000 | 3.600000 | 56.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 2016-12-31 15:00:00 | 16.000000 | 15.000000 | 17.000000 | 1018.000000 | 72.000000 | 7.200000 | 178.000000 | 0.000000 | 0.000000 | 0.000000 | 16.000000 |
| 75% | 2017-12-31 19:00:00 | 22.090000 | 21.000000 | 24.000000 | 1022.000000 | 87.000000 | 14.400000 | 270.000000 | 0.000000 | 0.000000 | 0.000000 | 40.000000 |
| max | 2018-12-31 22:00:00 | 42.450000 | 42.000000 | 48.000000 | 1090.000000 | 100.000000 | 230.400000 | 360.000000 | 12.000000 | 2.315000 | 21.500000 | 100.000000 |
| std | NaN | 8.024963 | 7.948565 | 8.613761 | 12.499474 | 21.813633 | 7.457877 | 116.543511 | 0.380658 | 0.007349 | 0.224579 | 30.338188 |
df_weather.isnull().sum()time 0 city_name 0 temp 0 temp_min 0 temp_max 0 pressure 0 humidity 0 wind_speed 0 wind_deg 0 rain_1h 0 rain_3h 0 snow_3h 0 clouds_all 0 weather_main 0 weather_description 0 dtype: int64
Observaciones: Podemos observar que hay datos con muy poca variabilidad sobretedo en rain y snow. Tal vez la mejor opción sea dejar solo el que sea más predictor para nuestros targets.
Por otra parte hay que comprobar las columnas temp ya que me da la sensación de que no varían tanto debido a que es por horas, entonces temp max y min pueden no ser relevantes.
3. Unión y Creación del Dataset Base
Como el dataset de energy es a nivel nacional, intentaremos hacer la media de de las condiciones del clima, que es por ciudad. Aunque perdamos algo de información, al calcular a nivel nacional no nos afectará tanto.
PROBLEMA CON LA UNIDAD DEL VIENTO
La naturaleza del viento es vectorial por lo que intar sumarlas como si fuesen escalares sería incorrecto y no representaría el Mundo real.
numeric_cols_weather = numeric_cols_weather = df_weather.select_dtypes(include=np.number).columns
# Conversión a componentes vectoriales del viento
df_weather['wind_x'] = df_weather['wind_speed'] * np.cos(np.radians((90 - df_weather['wind_deg']) % 360))
df_weather['wind_y'] = df_weather['wind_speed'] * np.sin(np.radians((90 - df_weather['wind_deg']) % 360))
# Agregar por tiempo (promediando las componentes vectoriales)
agg_dict = {col: 'mean' for col in numeric_cols_weather}
agg_dict['wind_x'] = 'mean'
agg_dict['wind_y'] = 'mean'wind_x es la componente zonal del viento, que indica la velocidad del viento en dirección este-oeste. wind_y es la componente meridional del viento, que indica la velocidad del viento en dirección norte-sur.
La velocidad del viento no se eliminará ya que podría indicar patrones que no están del todo relacionados con la dirección del viento, como por ejemplo la intensidad de una tormenta.
df_weather.describe()| time | temp | temp_min | temp_max | pressure | humidity | wind_speed | wind_deg | rain_1h | rain_3h | snow_3h | clouds_all | wind_x | wind_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 175271 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 | 175271.000000 |
| mean | 2016-12-31 14:55:28.938614784 | 16.559508 | 15.279950 | 18.024507 | 1016.215871 | 68.050339 | 8.889137 | 166.727097 | 0.068966 | 0.000386 | 0.004848 | 24.343052 | -1.377273 | 0.139367 |
| min | 2014-12-31 23:00:00 | -10.910000 | -10.910000 | -10.910000 | 918.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -192.508113 | -63.125445 |
| 25% | 2016-01-01 12:00:00 | 10.681000 | 9.639750 | 11.777594 | 1013.000000 | 53.000000 | 3.600000 | 56.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -5.515520 | -3.573166 |
| 50% | 2016-12-31 15:00:00 | 16.000000 | 15.000000 | 17.000000 | 1018.000000 | 72.000000 | 7.200000 | 178.000000 | 0.000000 | 0.000000 | 0.000000 | 16.000000 | 0.000000 | 0.000000 |
| 75% | 2017-12-31 19:00:00 | 22.090000 | 21.000000 | 24.000000 | 1022.000000 | 87.000000 | 14.400000 | 270.000000 | 0.000000 | 0.000000 | 0.000000 | 40.000000 | 2.757760 | 3.600000 |
| max | 2018-12-31 22:00:00 | 42.450000 | 42.000000 | 48.000000 | 1090.000000 | 100.000000 | 230.400000 | 360.000000 | 12.000000 | 2.315000 | 21.500000 | 100.000000 | 58.204659 | 160.049289 |
| std | NaN | 8.024963 | 7.948565 | 8.613761 | 12.499474 | 21.813633 | 7.457877 | 116.543511 | 0.380658 | 0.007349 | 0.224579 | 30.338188 | 8.675849 | 7.579610 |
numeric_cols_weather = df_weather.select_dtypes(include=np.number).columns
agg_dict = {col: 'mean' for col in numeric_cols_weather}
df_weather_agg = df_weather.groupby('time').agg(agg_dict).reset_index()
df_weather_agg.drop(columns=['wind_deg'], inplace=True) # Columna no necesaria tras la conversión a componentes vectoriales
df_merged = pd.merge(df_energy, df_weather_agg, on='time', how='inner')
# Columnas que tienen un único valor constante
cols_to_drop = [col for col in df_merged.columns if df_merged[col].nunique() == 1]
df_merged = df_merged.drop(columns=cols_to_drop, errors='ignore')
# Columnas de energías renovables
renewable_cols = [
'generation biomass', 'generation hydro run-of-river and poundage',
'generation hydro water reservoir', 'generation other renewable',
'generation solar', 'generation waste', 'generation wind onshore'
]
df_merged['total_renewable_generation'] = df_merged[[col for col in renewable_cols if col in df_merged.columns]].sum(axis=1)
df_merged = df_merged.set_index('time')
print(f"Columnas eliminadas: {cols_to_drop}")
print(f"Dataset unido y limpio. Dimensiones: {df_merged.shape}")Columnas eliminadas: ['generation fossil coal-derived gas', 'generation fossil oil shale', 'generation fossil peat', 'generation geothermal', 'generation hydro pumped storage aggregated', 'generation marine', 'generation wind offshore', 'forecast wind offshore eday ahead'] Dataset unido y limpio. Dimensiones: (35046, 33)
observaciones: Confirmamos lo que sospechábamos de algunas conlumnas con datos continuos.
El dataset tiene predicciones del TSO de algunas variables claves. Las eliminaré.
tso_cols_to_drop = ['price day ahead', 'forecast wind onshore day ahead', 'forecast solar day ahead', 'total load forecast']
df_cleaned = df_merged.drop(columns=tso_cols_to_drop, errors='ignore')df_cleaned.info()<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 35046 entries, 2015-02-26 13:00:00 to 2018-12-14 11:00:00 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 generation biomass 35046 non-null float64 1 generation fossil brown coal or lignite 35046 non-null float64 2 generation fossil gas 35046 non-null float64 3 generation fossil hard coal 35046 non-null float64 4 generation fossil oil 35046 non-null float64 5 generation hydro pumped storage consumption 35046 non-null float64 6 generation hydro run-of-river and poundage 35046 non-null float64 7 generation hydro water reservoir 35046 non-null float64 8 generation nuclear 35046 non-null float64 9 generation other 35046 non-null float64 10 generation other renewable 35046 non-null float64 11 generation solar 35046 non-null float64 12 generation waste 35046 non-null float64 13 generation wind onshore 35046 non-null float64 14 total load actual 35046 non-null float64 15 price actual 35046 non-null float64 16 temp 35046 non-null float64 17 temp_min 35046 non-null float64 18 temp_max 35046 non-null float64 19 pressure 35046 non-null float64 20 humidity 35046 non-null float64 21 wind_speed 35046 non-null float64 22 rain_1h 35046 non-null float64 23 rain_3h 35046 non-null float64 24 snow_3h 35046 non-null float64 25 clouds_all 35046 non-null float64 26 wind_x 35046 non-null float64 27 wind_y 35046 non-null float64 28 total_renewable_generation 35046 non-null float64 dtypes: float64(29) memory usage: 8.0 MB
4. Análisis Exploratorio del Dataset Combinado
Distribución de Variables Numéricas (Univariante)
def plot_distributions(df, columns, n_cols=3, figsize=(15, None), bins=30):
# Calcular filas necesarias
n_rows = int(np.ceil(len(columns) / n_cols))
if figsize[1] is None:
figsize = (figsize[0], n_rows * 4)
# Crear subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
axes = np.array([axes]).flatten() if n_rows * n_cols == 1 else axes.flatten()
for i, col in enumerate(columns[:len(axes)]):
ax = axes[i]
# Histograma con KDE en azul
sns.histplot(df[col], bins=bins, kde=True, ax=ax, alpha=0.7, color=naturgy_blue_palette[0])
# Estadísticas básicas
mean_val = df[col].mean()
median_val = df[col].median()
# Líneas de media y mediana
ax.axvline(mean_val, color=naturgy_orange_palette[0], linewidth=2, label=f'Media: {mean_val:.2f}')
ax.axvline(median_val, color=naturgy_orange_palette[1], linewidth=2, linestyle='--',
label=f'Mediana: {median_val:.2f}')
# Títulos y etiquetas
ax.set_title(f'{col}')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)
# Ocultar ejes no usados
for j in range(len(columns), len(axes)):
axes[j].set_visible(False)
plt.tight_layout()
plt.show()
num_cols = df_cleaned.select_dtypes(include=np.number).columns
plot_distributions(df_cleaned, num_cols)
Observaciones: hay bastante variedad en las distribuciones:
- Distribuciones con cola a la derecha: una gran parte de las variables muestran este patrón. Podría indicar valores extremos que tendríamos que tratar.
- Distribuciones normales: algunas variables presentan patrones más estables como la temperatura o el precio.
- Distribuciones bimodales/multimodales: podrían indicar patrones en el sistema energético.
- Distribuciones casi en 0: podrían tener muy poca relevancia para predicciones.
Análisis de las Variables Objetivo (Targets)
targets_1 = ['total load actual', 'price actual', 'total_renewable_generation']
plot_distributions(df_cleaned, targets_1)

Observaciones
- total load actual: se aprecia claramente una distribución bimodal, podría ser por las diferenctes estaciones. Podría ser interesante hacer una variable categórica que capture los patrones de consumo.
- price actual: tiene una distribución asimétrica positiva con una ligera bimodalidad, podría estar relacionado con la distribución de la generación total. De todas formas tal vez se podría aplicar algún tipo de transformación, eliminar los valores extremos .
- total_renewable_generation: distribución asimétrica pero casi hace una campana de Gauss.
Análisis de Outliers
num_cols = df_cleaned.select_dtypes(include=np.number).columns
n_cols = 4
n_rows = int(np.ceil(len(num_cols) / n_cols))
fig, axes = plt.subplots(n_rows, n_cols, figsize=(14, n_rows * 3))
axes = axes.flatten()
box_colors = {
'boxes': naturgy_blue_palette[0],
'whiskers': naturgy_blue_palette[1],
'medians': naturgy_orange_palette[0],
'caps': naturgy_blue_palette[2],
'fliers': naturgy_orange_palette[1]
}
for i, col in enumerate(num_cols):
if i < len(axes):
boxplot = axes[i].boxplot(df_cleaned[col], patch_artist=True)
for element, color in box_colors.items():
plt.setp(boxplot[element], color=color)
for patch in boxplot['boxes']:
patch.set(facecolor=naturgy_blue_palette[2], alpha=0.7)
axes[i].set_title(col, fontsize=10, color=naturgy_blue_palette[0], fontweight='bold')
axes[i].tick_params(axis='both', labelsize=8, colors=naturgy_blue_palette[1])
axes[i].set_xlabel('', fontsize=8)
axes[i].set_ylabel('Value', fontsize=8, color=naturgy_blue_palette[1])
axes[i].grid(True, linestyle='--', alpha=0.3)
axes[i].set_facecolor('#f8f9fa')
for j in range(i + 1, len(axes)):
axes[j].set_visible(False)
plt.suptitle('Detección de Outliers con Boxplots',
fontsize=16,
color=naturgy_blue_palette[0],
fontweight='bold')
plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()
Observaciones: A pesar de que se aprecian numerosos outlayers estos pueden guardan información bastante relevante, como fenómenos atmosféricos extremos. Podría ser interesante tratar solo aqullos que puedan sesgar el modelo.
Análisis de Correlaciones
def plot_correlation_matrix(df, figsize=(24, 20), cmap='coolwarm', annot=True, fontsize_annot=8, title = "Matriz de Correlación"):
fig, ax = plt.subplots(figsize=figsize)
sns.heatmap(
df,
annot=annot,
cmap=cmap,
center=0,
fmt='.2f',
annot_kws={"size": fontsize_annot},
cbar_kws={"shrink": 0.8}
)
plt.title(title, fontsize=16, color='#044474', fontweight='bold')
plt.xticks(rotation=90, fontsize=8)
plt.yticks(rotation=0, fontsize=8)
plt.tight_layout()
return fig, axplot_correlation_matrix(df_cleaned.select_dtypes(include=np.number).corr(), title="Matriz de Correlación de Variables Numéricas")
(<Figure size 2400x2000 with 2 Axes>,
<Axes: title={'center': 'Matriz de Correlación de Variables Numéricas'}>)
Observaciones : Es una tabla con bastantes dimensiones, no obstante probablemente se eliminen variables que no apartan nada y puede añadir ruido.
- Correlación entre fuentes fósiles: nos indica que suben y bajan de forma similar.
- Las condiciones meteorológicos parecen factores determinantes para calcular la genereación de las renovables.
- Las temperaturas entre ellas están demasiado correlacionadas (colinealidad).
- La generación nuclear no parece ser un predictor interesante.
Correlación entre Targets
Analizamos específicamente la relación entre nuestras variables objetivo. Esto es clave para un modelo multi-target, ya que una alta correlación puede indicar que los modelos pueden aprender patrones compartidos.
plot_correlation_matrix(df_cleaned[targets_1].corr(), figsize=(10, 8), title="Matriz de Correlación entre Variables Target")(<Figure size 1000x800 with 2 Axes>,
<Axes: title={'center': 'Matriz de Correlación entre Variables Target'}>)
Series Temporales y Estacionalidad
Creo que sería interesante añadir columnas basadas en la fecha y hora. Por la naturaleza cíclica de los datos. De esta forma el modelo podría beneficiarse de patrones que pueda proporcionar las fechas.
5 Limpieza, Procesamiento y Feature Engineering
Tratamiento de outlayers (Extremos)
columns_to_cap = [
'generation fossil gas', 'generation fossil oil',
'generation hydro pumped storage consumption',
'generation biomass', 'pressure', 'wind_speed', 'wind_x', 'wind_y',
]
df_processed = df_cleaned.copy()
for col in columns_to_cap:
Q1 = df_processed[col].quantile(0.25)
Q3 = df_processed[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df_processed[col] = np.clip(df_processed[col], lower, upper)Tranformaciones
df_transformed = df_processed.copy()
cols_to_log1p = [
'generation hydro pumped storage consumption',
'rain_1h',
]
cols_to_log = [
'generation fossil gas',
'generation hydro water reservoir',
]
for col in cols_to_log1p:
if col in df_transformed.columns:
df_transformed[col] = np.log1p(df_transformed[col])
df_transformed.rename(columns={col: f'{col}_log'}, inplace=True)
print(f"Columna '{col}' transformada con log1p.")
epsilon = 1e-9
for col in cols_to_log:
if col in df_transformed.columns:
df_transformed[col] = np.log(df_transformed[col] + epsilon)
df_transformed.rename(columns={col: f'{col}_log'}, inplace=True)
print(f"Columna '{col}' transformada con log.")
plot_distributions(df_transformed, [f'{c}_log' for c in cols_to_log1p + cols_to_log])Columna 'generation hydro pumped storage consumption' transformada con log1p. Columna 'rain_1h' transformada con log1p. Columna 'generation fossil gas' transformada con log. Columna 'generation hydro water reservoir' transformada con log.

Las gráficas ya muestran una distribución más simétricas en el sentido que media = mediana (aprox)
Con clauds_all provoca un sesgo hacia la izquierda bastante grande y no mejora la distribución, no aplicaré log.
Análisis del Factor de Inflación de la Varianza (VIF)
El VIF es una medida numérica para la multicolinealidad. Un valor de VIF > 10 es una señal fuerte de que una variable es redundante y podría ser eliminada. Esto ayuda a crear modelos más simples y estables.
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif(df_numeric):
vif_data = pd.DataFrame()
vif_data['feature'] = df_numeric.columns
vif_data['VIF'] = [variance_inflation_factor(df_numeric.values, i) for i in range(len(df_numeric.columns))]
return vif_data.sort_values(by='VIF', ascending=False)
numeric_df = df_transformed.select_dtypes(include=np.number).dropna()
vif_results = calculate_vif(numeric_df)
print("Resultados del VIF (Factor de Inflación de la Varianza):")
display(vif_results.head(20))Resultados del VIF (Factor de Inflación de la Varianza):
| feature | VIF | |
|---|---|---|
| 16 | temp | 7714.952660 |
| 19 | pressure | 2887.828290 |
| 2 | generation fossil gas_log | 2784.263753 |
| 18 | temp_max | 2056.436473 |
| 17 | temp_min | 1988.810284 |
| 7 | generation hydro water reservoir_log | 413.393396 |
| 14 | total load actual | 394.255558 |
| 28 | total_renewable_generation | 307.434760 |
| 10 | generation other renewable | 111.016859 |
| 20 | humidity | 76.294145 |
| 8 | generation nuclear | 72.803474 |
| 13 | generation wind onshore | 69.761796 |
| 12 | generation waste | 66.919433 |
| 4 | generation fossil oil | 65.248394 |
| 0 | generation biomass | 64.965086 |
| 15 | price actual | 31.559148 |
| 3 | generation fossil hard coal | 28.369906 |
| 6 | generation hydro run-of-river and poundage | 28.091222 |
| 9 | generation other | 19.192147 |
| 11 | generation solar | 10.053490 |
Para producción se podría plantear alguna lógica a la hora de elimnar estas columnas
# Se incluen algunas columnas con vif > 10, las de generación de energías renovables ya que tienen una correlación demasiado alta ya con la total.
cols_to_drop_vif = [
'generation biomass', 'generation hydro run-of-river and poundage',
'generation hydro water reservoir', 'generation other renewable',
'generation solar', 'generation waste', 'generation wind onshore',
'temp_min', 'temp_max',
]
df_vif_cleaned = df_transformed.drop(columns=cols_to_drop_vif, errors='ignore')
print(f"Dimensiones tras eliminar variables con alto VIF: {df_vif_cleaned.shape}")
df_vif_cleaned.info()Dimensiones tras eliminar variables con alto VIF: (35046, 21) <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 35046 entries, 2015-02-26 13:00:00 to 2018-12-14 11:00:00 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 generation fossil brown coal or lignite 35046 non-null float64 1 generation fossil gas_log 35046 non-null float64 2 generation fossil hard coal 35046 non-null float64 3 generation fossil oil 35046 non-null float64 4 generation hydro pumped storage consumption_log 35046 non-null float64 5 generation hydro water reservoir_log 35046 non-null float64 6 generation nuclear 35046 non-null float64 7 generation other 35046 non-null float64 8 total load actual 35046 non-null float64 9 price actual 35046 non-null float64 10 temp 35046 non-null float64 11 pressure 35046 non-null float64 12 humidity 35046 non-null float64 13 wind_speed 35046 non-null float64 14 rain_1h_log 35046 non-null float64 15 rain_3h 35046 non-null float64 16 snow_3h 35046 non-null float64 17 clouds_all 35046 non-null float64 18 wind_x 35046 non-null float64 19 wind_y 35046 non-null float64 20 total_renewable_generation 35046 non-null float64 dtypes: float64(21) memory usage: 5.9 MB
6 Featuring Engineering avanzado
Creamos características temporales y los tranformamos en componentes cíclicas de seno y coseno para capturar la naturaleza cíclica de los datos temporales.
df_vif_cleaned['hour'] = df_vif_cleaned.index.hour
df_vif_cleaned['dayofweek'] = df_vif_cleaned.index.dayofweek
df_vif_cleaned['month'] = df_vif_cleaned.index.month
df_vif_cleaned['year'] = df_vif_cleaned.index.year
df_vif_cleaned['dayofyear'] = df_vif_cleaned.index.dayofyear
# Features cíclicas con sen y cos
# Con esto resolvemos el problema del reloj, el reloj no es una linea recta sino que es un círculo.
df_vif_cleaned['hour_sin'] = np.sin(2 * np.pi * df_vif_cleaned['hour']/24.0)
df_vif_cleaned['hour_cos'] = np.cos(2 * np.pi * df_vif_cleaned['hour']/24.0)
# Para el día de la semana (ciclo de 7 días)
df_vif_cleaned['dayofweek_sin'] = np.sin(2 * np.pi * df_vif_cleaned['dayofweek']/7.0)
df_vif_cleaned['dayofweek_cos'] = np.cos(2 * np.pi * df_vif_cleaned['dayofweek']/7.0)
# Para el mes del año (ciclo de 12 meses)
df_vif_cleaned['month_sin'] = np.sin(2 * np.pi * df_vif_cleaned['month']/12.0)
df_vif_cleaned['month_cos'] = np.cos(2 * np.pi * df_vif_cleaned['month']/12.0)
# Para el día del año (ciclo de 365 días)
df_vif_cleaned['dayofyear_sin'] = np.sin(2 * np.pi * df_vif_cleaned['dayofyear']/365.0)
df_vif_cleaned['dayofyear_cos'] = np.cos(2 * np.pi * df_vif_cleaned['dayofyear']/365.0)
# Eliminar las columnas temporales lineales
cols_to_drop_cyclical = ['hour', 'dayofweek', 'month', 'year', 'dayofyear']
df_vif_cleaned = df_vif_cleaned.drop(columns=cols_to_drop_cyclical, errors='ignore')
print(f"Dimensiones tras añadir variables cíclicas: {df_vif_cleaned.shape}")Dimensiones tras añadir variables cíclicas: (35046, 29)
Para modelos de predicciones en el futuro, lo ideal es que el conjunto de datos para los modelos aporten preguntas tan simples como: ¿Qué fue lo último que paso hace una?, ¿Cómo era la tendencia hace 24h?, etc. Siguiendo esta lógico podemos aprovecharnos de dos estrategias fundamentales, de esta forma capturaremos información bastante interesante de estas series temporales.
Ordenación de los dato en orden cronológico
Para Poder validar los resultados.
print(df_vif_cleaned.index[:5])DatetimeIndex(['2015-02-26 13:00:00', '2015-04-30 07:00:00',
'2015-06-13 20:00:00', '2015-06-23 19:00:00',
'2015-06-30 23:00:00'],
dtype='datetime64[ns]', name='time', freq=None)
df_vif_cleaned = df_vif_cleaned.sort_index()
print(df_vif_cleaned.index[:5])DatetimeIndex(['2014-12-31 23:00:00', '2015-01-01 00:00:00',
'2015-01-01 01:00:00', '2015-01-01 02:00:00',
'2015-01-01 03:00:00'],
dtype='datetime64[ns]', name='time', freq=None)
Lags
Es el valor de la misma variable en un período anterior (t) Me recuerda un poco al lag de los videojuegos.
dimensions_before = df_vif_cleaned.shape
def add_lags(df, col, lags, prefix=None):
if prefix is None:
prefix = col
for lag in lags:
df[f'{prefix}_lag_{lag}h'] = df[col].shift(lag)
return df
# Lags para nuestros targets
lags_targets = [1, 3, 6, 12, 24, 48, 168, 336]
# Los lags de la generación de renovables ya se generan en las de la generacion de energía
df_vif_cleaned = add_lags(df_vif_cleaned, 'total load actual', lags_targets, prefix='load')
df_vif_cleaned = add_lags(df_vif_cleaned, 'price actual', lags_targets, prefix='price')
# Para las variables de clima
lags_weather = [1, 3, 6, 12, 24, 48, 168, 336]
for col in df_vif_cleaned.columns:
if 'temp' in col or 'rain' in col or 'wind' in col or 'pressure' in col or 'clouds' in col:
df_vif_cleaned = add_lags(df_vif_cleaned, col, lags_weather, prefix=col)
# Para la generación de energía
lags_generation = [1, 3, 6, 12, 24, 48, 168, 336]
for col in df_vif_cleaned.columns:
if 'generation' in col:
df_vif_cleaned = add_lags(df_vif_cleaned, col, lags_generation, prefix=col)7 Pipeline de Procesamiento de datos
def preprocess_data_pipeline(energy_parquet_path: str,
weather_parquet_path: str,
is_training: bool = True) -> pd.DataFrame:
"""
Función completa para preprocesar los datos de energía y clima
Args:
energy_parquet_path (str): Ruta al archivo parquet de datos de energía.
weather_parquet_path (str): Ruta al archivo parquet de datos de clima.
is_training (bool): True si los datos se procesan para entrenamiento
(se crearán targets futuros).
False para predicción (no se crean targets futuros).
Returns:
pd.DataFrame: DataFrame procesado y listo para el siguiente paso
(separación X/y o predicción).
"""
print("Iniciando pipeline de preprocesamiento de datos...")
# --- 1. Carga de Datasets ---
print(f"Cargando datos de energía desde: {energy_parquet_path}")
df_energy = pd.read_parquet(energy_parquet_path)
print(f"Cargando datos de clima desde: {weather_parquet_path}")
df_weather = pd.read_parquet(weather_parquet_path)
# --- 2. Preprocesamiento de Datos de Clima y Unión ---
print("Preprocesando datos de clima y uniendo datasets...")
if 'wind_speed' in df_weather.columns and 'wind_deg' in df_weather.columns:
df_weather['wind_x'] = df_weather['wind_speed'] * np.cos(np.radians((90 - df_weather['wind_deg']) % 360))
df_weather['wind_y'] = df_weather['wind_speed'] * np.sin(np.radians((90 - df_weather['wind_deg']) % 360))
else:
print("Advertencia: 'wind_speed' o 'wind_deg' no encontradas en datos de clima. Omitiendo wind_x/wind_y.")
numeric_cols_weather = df_weather.select_dtypes(include=np.number).columns.tolist()
agg_dict = {col: 'mean' for col in numeric_cols_weather if col in df_weather.columns}
if 'wind_x' in agg_dict: agg_dict['wind_x'] = 'mean' # Asegurar que se usa si existe
if 'wind_y' in agg_dict: agg_dict['wind_y'] = 'mean'
if 'time' not in df_weather.columns:
raise ValueError("Columna 'time' no encontrada en df_weather, necesaria para la agregación.")
df_weather_agg = df_weather.groupby('time').agg(agg_dict).reset_index()
df_merged = pd.merge(df_energy, df_weather_agg, on='time', how='inner')
cols_to_drop_constant = [col for col in df_merged.columns if df_merged[col].nunique(dropna=False) == 1]
if cols_to_drop_constant:
print(f"Eliminando columnas constantes: {cols_to_drop_constant}")
df_merged = df_merged.drop(columns=cols_to_drop_constant, errors='ignore')
renewable_cols_list = [
'generation biomass', 'generation hydro run-of-river and poundage',
'generation hydro water reservoir', 'generation other renewable',
'generation solar', 'generation waste', 'generation wind onshore'
]
existing_renewable_cols = [col for col in renewable_cols_list if col in df_merged.columns]
if existing_renewable_cols:
df_merged['total_renewable_generation'] = df_merged[existing_renewable_cols].sum(axis=1)
else:
print("Advertencia: Columnas de generación renovable no encontradas. 'total_renewable_generation' será 0.")
df_merged['total_renewable_generation'] = 0
if 'time' not in df_merged.columns:
raise ValueError("Columna 'time' no encontrada en df_merged, necesaria para el índice.")
df_merged = df_merged.set_index('time')
df_merged.index = pd.to_datetime(df_merged.index)
tso_cols_to_drop_list = ['price day ahead', 'forecast wind onshore day ahead', 'forecast solar day ahead', 'total load forecast']
existing_tso_cols = [col for col in tso_cols_to_drop_list if col in df_merged.columns]
if existing_tso_cols:
print(f"Eliminando columnas TSO: {existing_tso_cols}")
df_processed = df_merged.drop(columns=existing_tso_cols, errors='ignore')
else:
df_processed = df_merged.copy()
# --- 3. Limpieza, Procesamiento y Feature Engineering Adicional ---
print("Aplicando tratamiento de outliers y transformaciones...")
columns_to_cap_list = [
'generation fossil gas', 'generation fossil oil',
'generation biomass', 'pressure', 'wind_speed', 'wind_x', 'wind_y',
]
for col in columns_to_cap_list:
if col in df_processed.columns:
Q1 = df_processed[col].quantile(0.25); Q3 = df_processed[col].quantile(0.75)
IQR = Q3 - Q1
df_processed[col] = np.clip(df_processed[col], Q1 - 1.5 * IQR, Q3 + 1.5 * IQR)
else: print(f"Advertencia: Columna para capping '{col}' no encontrada.")
cols_to_log1p_list = ['generation fossil oil', 'generation hydro pumped storage consumption']
cols_to_log_list = ['generation fossil gas', 'generation hydro water reservoir', 'wind_speed']
epsilon = 1e-9
for col in cols_to_log1p_list:
if col in df_processed.columns:
df_processed[f'{col}_log1p'] = np.log1p(df_processed[col])
df_processed.drop(columns=[col], inplace=True, errors='ignore')
else: print(f"Advertencia: Columna para log1p '{col}' no encontrada.")
for col in cols_to_log_list:
if col in df_processed.columns:
df_processed[f'{col}_log'] = np.log(df_processed[col] + epsilon)
df_processed.drop(columns=[col], inplace=True, errors='ignore')
else: print(f"Advertencia: Columna para log '{col}' no encontrada.")
print("Creando características temporales...")
df_processed['hour_val'] = df_processed.index.hour
df_processed['dayofweek_val'] = df_processed.index.dayofweek
df_processed['month_val'] = df_processed.index.month
df_processed['year_val'] = df_processed.index.year
df_processed['dayofyear_val'] = df_processed.index.dayofyear
df_processed['hour_sin'] = np.sin(2 * np.pi * df_processed['hour_val'] / 24.0)
df_processed['hour_cos'] = np.cos(2 * np.pi * df_processed['hour_val'] / 24.0)
df_processed['dayofweek_sin'] = np.sin(2 * np.pi * df_processed['dayofweek_val'] / 7.0)
df_processed['dayofweek_cos'] = np.cos(2 * np.pi * df_processed['dayofweek_val'] / 7.0)
df_processed['month_sin'] = np.sin(2 * np.pi * df_processed['month_val'] / 12.0)
df_processed['month_cos'] = np.cos(2 * np.pi * df_processed['month_val'] / 12.0)
df_processed['dayofyear_sin'] = np.sin(2 * np.pi * df_processed['dayofyear_val'] / 365.0)
df_processed['dayofyear_cos'] = np.cos(2 * np.pi * df_processed['dayofyear_val'] / 365.0)
cols_to_drop_vif_list = [
'generation biomass', 'generation hydro run-of-river and poundage',
'generation hydro water reservoir',
'generation hydro water reservoir_log',
'generation other renewable', 'generation solar', 'generation waste',
'generation wind onshore', 'temp_min', 'temp_max',
'year_val', 'month_val', 'dayofyear_val', 'hour_val', 'dayofweek_val'
]
actual_cols_to_drop_vif = [col for col in cols_to_drop_vif_list if col in df_processed.columns]
if actual_cols_to_drop_vif:
print(f"Eliminando columnas por VIF (predefinidas): {actual_cols_to_drop_vif}")
df_vif_cleaned = df_processed.drop(columns=actual_cols_to_drop_vif, errors='ignore')
else:
df_vif_cleaned = df_processed.copy()
# --- 4. Feature Engineering Avanzado (Lags) ---
print("Aplicando feature engineering avanzado (lags)...")
df_vif_cleaned = df_vif_cleaned.sort_index()
lag_periods = [1, 3, 6, 12, 24, 48, 168, 336]
cols_for_target_lags = {
'load': 'total load actual',
'price': 'price actual',
'renewable_gen': 'total_renewable_generation'
}
for prefix, source_col in cols_for_target_lags.items():
if source_col in df_vif_cleaned.columns:
for lag in lag_periods:
df_vif_cleaned[f'{prefix}_lag_{lag}h'] = df_vif_cleaned[source_col].shift(lag)
else: print(f"Advertencia: Columna fuente para lags de target '{source_col}' no encontrada.")
cols_to_exclude_from_other_lags = list(cols_for_target_lags.values()) + \
[col for col in df_vif_cleaned.columns if '_lag_' in col or '_sin' in col or '_cos' in col]
other_cols_for_lags = [col for col in df_vif_cleaned.columns if col not in cols_to_exclude_from_other_lags and df_vif_cleaned[col].dtype in [np.float64, np.int64]]
for col_name in other_cols_for_lags:
safe_prefix = col_name.replace(' ', '_').replace('-', '_').lower()
for lag in lag_periods:
df_vif_cleaned[f'{safe_prefix}_lag_{lag}h'] = df_vif_cleaned[col_name].shift(lag)
rolling_lag_windows_list = [1, 3, 6, 12, 24, 48]
rolling_base_cols_map = {
'load_mean': 'total load actual',
'price_mean': 'price actual',
'renewable_mean': 'total_renewable_generation'
}
for prefix_key, original_col in rolling_base_cols_map.items():
if original_col in df_vif_cleaned.columns:
for lag_window in rolling_lag_windows_list:
df_vif_cleaned[f'{prefix_key}_{lag_window}h_lag_{lag_window}h'] = df_vif_cleaned[original_col].shift(lag_window)
else: print(f"Advertencia: Columna base para 'rolling lags' '{original_col}' no encontrada.")
print("Eliminando NaNs introducidos por lags...")
df_vif_cleaned = df_vif_cleaned.dropna()
# --- 5. Creación de Variables Objetivo Futuras ---
final_df = df_vif_cleaned.copy()
if is_training:
print("Creando variables objetivo futuras...")
targets_to_create_map = {
'renewable_H_plus_6': ('total_renewable_generation', -6),
'load_H_plus_6': ('total load actual', -6),
'price_H_plus_6': ('price actual', -6),
'renewable_H_plus_1': ('total_renewable_generation', -1),
'load_H_plus_1': ('total load actual', -1),
'price_H_plus_1': ('price actual', -1)
}
for new_target_col, (source_col, shift_val) in targets_to_create_map.items():
if source_col in final_df.columns: # Verificar si la columna fuente existe
final_df[new_target_col] = final_df[source_col].shift(shift_val)
else:
print(f"Advertencia: Columna fuente '{source_col}' para crear target '{new_target_col}' no encontrada. Target no se creará.")
print("Eliminando NaNs introducidos por el shift de targets...")
final_df = final_df.dropna()
else:
print("Modo predicción: No se crearán variables objetivo futuras.")
print(f"Pipeline de preprocesamiento completado. Dimensiones finales del DataFrame: {final_df.shape}")
return final_df
8 Preparación y Entrenamiento de modelos
# Ruta para guardar los modelos
models_path = '../Modelo_Entrenado/'En este apartado experimentaré principalment con dos modelos:
XGBoost: me pareció una opción interesante por su eficiencia en el entrenamiento, probé con random forest y tardaba más con resultados similares. Su opción de MultiOutputRegressor simplifica mucho el entrenamiento ideal para nuetro caso que tenemos varios targets. Además con una configuración sencilla de buenos resultados.
Red Neuronal: interesante para nuestras variables, especialmente para predecir el precio, que en nustro contexto es una de las variables más difíciles de predecir debido a su volatilidad, no tienen un patrón cíclico tan marcado como la generación o demanda.
Crear las variables objetivo futuras
df_vif_cleaned['renewable_H_plus_6'] = df_vif_cleaned['total_renewable_generation'].shift(-6)
df_vif_cleaned['load_H_plus_6'] = df_vif_cleaned['total load actual'].shift(-6)
df_vif_cleaned['renewable_H_plus_1'] = df_vif_cleaned['total_renewable_generation'].shift(-1)
df_vif_cleaned['load_H_plus_1'] = df_vif_cleaned['total load actual'].shift(-1)
df_vif_cleaned['price_H_plus_1'] = df_vif_cleaned['price actual'].shift(-1)
df_vif_cleaned = df_vif_cleaned.dropna() Division de datos entrenamiento validacion y prueba final
Dejaremos un mes de datos totalmente aislados para la prueba final de los modelos.
cutoff_date = pd.to_datetime('2018-12-10')
df_dev = df_vif_cleaned[df_vif_cleaned.index < cutoff_date]
targets = ['renewable_H_plus_6', 'load_H_plus_6',
'renewable_H_plus_1', 'load_H_plus_1', 'price_H_plus_1']
#Datos para la prueba final.
df_final_test = df_vif_cleaned[df_vif_cleaned.index >= cutoff_date]
X_final_test = df_final_test.drop(columns=targets, errors='ignore')
y_final_test = df_final_test[targets]
# Datos para el entrenamiento
X_dev_features = df_dev.drop(columns=targets, errors='ignore')
y_dev_targets = df_dev[targets]
# Split de datos para el entrenamiento
X_train, X_val, y_train, y_val = train_test_split(
X_dev_features, y_dev_targets, test_size=0.2, random_state=42, shuffle=False
)Preprocesamiento y Escalado
numeric_features = X_train.select_dtypes(include=np.number).columns.tolist()
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features)
]
)
X_train_scaled = preprocessor.fit_transform(X_train)
X_val_scaled = preprocessor.transform(X_val)XGBoost
Configuración mínima. Tiene por defecto tiene regularización L1 y L2 Con esta configuración da muy buenos resultados sin afectar a la velocidad de entrenamiento. Hice alguna prueba con grid search pero los resultados no mejoran, entonces preferí ponerlos a mano.
Entrenamiento y Guardado de XGBoost
xgb_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', MultiOutputRegressor(xgb.XGBRegressor(objective='reg:squarederror', random_state=42, n_estimators=1700,
max_depth=8, subsample=0.8,
learning_rate=0.005, tree_method='gpu_hist', gpu_id=0
)))
])
print("Entrenando el modelo XGBoost...")
start_time_xgb = time.time()
xgb_pipeline.fit(X_train, y_train)
end_time_xgb = time.time()
training_time_xgb = end_time_xgb - start_time_xgb
print("Entrenamiento completado.")
print(f"Tiempo de entrenamiento de XGBoost: {training_time_xgb:.2f} segundos")
Entrenando el modelo XGBoost... Entrenamiento completado. Tiempo de entrenamiento de XGBoost: 177.59 segundos
model_filename_xgb = models_path + '/xgb_model-energy_weather.joblib'
joblib.dump(xgb_pipeline, model_filename_xgb)
print(f"Modelo XGBoost guardado en: {model_filename_xgb}")Modelo XGBoost guardado en: ../Modelo_Entrenado//xgb_model-energy_weather.joblib
Red Neuronal
La red neural que he diseñado es bastante sencilla
- Estructura: 1 capa de entrada, 2 ocultas (256, 128), y 1 capa de salida.
- Hiperparámetros
- BatchNorm: después de cada capa densa para estabilizar y acelerar el entrenamiento.
- Dropout(0.2): para prevenir sobreajuste. Está configurado al 20% para regularizar fuertemente el modelo.
- Optimizador: Adam. Es el que mejor funciona generalmente.
- Función de activación: ReLU en las capas ocultas, lineal en la capa de salida para regresión.
- Función de pérdida: (
L1Loss) pérdida del error medio cuadrático. - Learning Rate: Comenzamos con un valor de 0.00005 y utilizamos un scheduler para reducir la tasa de aprendizaje cuando la pérdida de validación se estanca.
- Early Stopping: Detenemos el entrenamiento si la pérdida de validación no mejora durante 40 épocas.
Preparación de Datos para Red Neuronal
X_train_nn = X_train_scaled
X_val_nn = X_val_scaled
y_scaler = StandardScaler()
y_train_nn_scaled = y_scaler.fit_transform(y_train)
y_val_nn_scaled = y_scaler.transform(y_val)
train_loader = DataLoader(
TensorDataset(torch.tensor(X_train_nn, dtype=torch.float32), torch.tensor(y_train_nn_scaled, dtype=torch.float32)),
batch_size=64, shuffle=True
)
val_loader = DataLoader(
TensorDataset(torch.tensor(X_val_nn, dtype=torch.float32), torch.tensor(y_val_nn_scaled, dtype=torch.float32)),
batch_size=64, shuffle=False
)Entrenamiento y Guardado de Red Neuronal
# Modelo
class MLP(nn.Module):
def __init__(self, input_size, output_size):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_size, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, output_size)
)
def forward(self, x): return self.layers(x)
# Parámetros del modelo
epochs = 250
patience = 40
min_delta = 0.001
best_val_loss = float('inf')
patience_counter = 0
model = MLP(X_train_nn.shape[1], y_train.shape[1])
criterion = nn.L1Loss()
optimizer = optim.Adam(model.parameters(), lr=0.00001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=20)
train_losses, test_losses = [], []
print("Entrenando el modelo de Red Neuronal (PyTorch) con Early Stopping...")
start_time_nn = time.time()
for epoch in range(epochs):
# Entrenamiento
model.train()
train_loss = 0
for Xb, yb in train_loader:
optimizer.zero_grad()
loss = criterion(model(Xb), yb)
loss.backward()
optimizer.step()
train_loss += loss.item()
train_losses.append(train_loss / len(train_loader))
# Validación
model.eval()
with torch.no_grad():
test_loss = sum(criterion(model(Xb), yb).item() for Xb, yb in val_loader) / len(val_loader)
test_losses.append(test_loss)
# Learning rate scheduler
scheduler.step(test_loss)
# Early stopping
if test_loss < best_val_loss - min_delta:
best_val_loss = test_loss
patience_counter = 0
# Guardar el mejor modelo
torch.save(model.state_dict(), 'best_nn_model.pth')
else:
patience_counter += 1
print(f'Epoch [{epoch+1}/{epochs}], Train Loss: {train_losses[-1]:.4f}, Test Loss: {test_losses[-1]:.4f}, LR: {optimizer.param_groups[0]["lr"]:.6f}')
if patience_counter >= patience:
print(f'Early stopping en época {epoch+1}')
break
# Cargar el mejor modelo
model.load_state_dict(torch.load('best_nn_model.pth'))
end_time_nn = time.time()
training_time_nn = end_time_nn - start_time_nn
print("Entrenamiento de la Red Neuronal completado.")
print(f"Tiempo de entrenamiento de la Red Neuronal: {training_time_nn:.2f} segundos")
print(f"Mejor pérdida de validación: {best_val_loss:.4f}")
Entrenando el modelo de Red Neuronal (PyTorch) con Early Stopping... Epoch [1/250], Train Loss: 0.7543, Test Loss: 0.6501, LR: 0.000010 Epoch [2/250], Train Loss: 0.6144, Test Loss: 0.5692, LR: 0.000010 Epoch [3/250], Train Loss: 0.5488, Test Loss: 0.5051, LR: 0.000010 Epoch [4/250], Train Loss: 0.5083, Test Loss: 0.4794, LR: 0.000010 Epoch [5/250], Train Loss: 0.4792, Test Loss: 0.4436, LR: 0.000010 Epoch [6/250], Train Loss: 0.4587, Test Loss: 0.4114, LR: 0.000010 Epoch [7/250], Train Loss: 0.4418, Test Loss: 0.4014, LR: 0.000010 Epoch [8/250], Train Loss: 0.4291, Test Loss: 0.4041, LR: 0.000010 Epoch [9/250], Train Loss: 0.4189, Test Loss: 0.3831, LR: 0.000010 Epoch [10/250], Train Loss: 0.4123, Test Loss: 0.3628, LR: 0.000010 Epoch [11/250], Train Loss: 0.4032, Test Loss: 0.3633, LR: 0.000010 Epoch [12/250], Train Loss: 0.3956, Test Loss: 0.3519, LR: 0.000010 Epoch [13/250], Train Loss: 0.3904, Test Loss: 0.3480, LR: 0.000010 Epoch [14/250], Train Loss: 0.3869, Test Loss: 0.3416, LR: 0.000010 Epoch [15/250], Train Loss: 0.3815, Test Loss: 0.3338, LR: 0.000010 Epoch [16/250], Train Loss: 0.3770, Test Loss: 0.3372, LR: 0.000010 Epoch [17/250], Train Loss: 0.3710, Test Loss: 0.3271, LR: 0.000010 Epoch [18/250], Train Loss: 0.3713, Test Loss: 0.3240, LR: 0.000010 Epoch [19/250], Train Loss: 0.3662, Test Loss: 0.3272, LR: 0.000010 Epoch [20/250], Train Loss: 0.3621, Test Loss: 0.3229, LR: 0.000010 Epoch [21/250], Train Loss: 0.3595, Test Loss: 0.3234, LR: 0.000010 Epoch [22/250], Train Loss: 0.3573, Test Loss: 0.3115, LR: 0.000010 Epoch [23/250], Train Loss: 0.3535, Test Loss: 0.3133, LR: 0.000010 Epoch [24/250], Train Loss: 0.3504, Test Loss: 0.3124, LR: 0.000010 Epoch [25/250], Train Loss: 0.3487, Test Loss: 0.3112, LR: 0.000010 Epoch [26/250], Train Loss: 0.3453, Test Loss: 0.3144, LR: 0.000010 Epoch [27/250], Train Loss: 0.3430, Test Loss: 0.3018, LR: 0.000010 Epoch [28/250], Train Loss: 0.3418, Test Loss: 0.3078, LR: 0.000010 Epoch [29/250], Train Loss: 0.3398, Test Loss: 0.2985, LR: 0.000010 Epoch [30/250], Train Loss: 0.3379, Test Loss: 0.3023, LR: 0.000010 Epoch [31/250], Train Loss: 0.3356, Test Loss: 0.2980, LR: 0.000010 Epoch [32/250], Train Loss: 0.3335, Test Loss: 0.2972, LR: 0.000010 Epoch [33/250], Train Loss: 0.3336, Test Loss: 0.2962, LR: 0.000010 Epoch [34/250], Train Loss: 0.3315, Test Loss: 0.3312, LR: 0.000010 Epoch [35/250], Train Loss: 0.3290, Test Loss: 0.2987, LR: 0.000010 Epoch [36/250], Train Loss: 0.3284, Test Loss: 0.2952, LR: 0.000010 Epoch [37/250], Train Loss: 0.3282, Test Loss: 0.2838, LR: 0.000010 Epoch [38/250], Train Loss: 0.3237, Test Loss: 0.2873, LR: 0.000010 Epoch [39/250], Train Loss: 0.3226, Test Loss: 0.2870, LR: 0.000010 Epoch [40/250], Train Loss: 0.3216, Test Loss: 0.2902, LR: 0.000010 Epoch [41/250], Train Loss: 0.3196, Test Loss: 0.2892, LR: 0.000010 Epoch [42/250], Train Loss: 0.3187, Test Loss: 0.2918, LR: 0.000010 Epoch [43/250], Train Loss: 0.3169, Test Loss: 0.2775, LR: 0.000010 Epoch [44/250], Train Loss: 0.3156, Test Loss: 0.2777, LR: 0.000010 Epoch [45/250], Train Loss: 0.3147, Test Loss: 0.2771, LR: 0.000010 Epoch [46/250], Train Loss: 0.3136, Test Loss: 0.2816, LR: 0.000010 Epoch [47/250], Train Loss: 0.3120, Test Loss: 0.2780, LR: 0.000010 Epoch [48/250], Train Loss: 0.3126, Test Loss: 0.2707, LR: 0.000010 Epoch [49/250], Train Loss: 0.3093, Test Loss: 0.2822, LR: 0.000010 Epoch [50/250], Train Loss: 0.3107, Test Loss: 0.2698, LR: 0.000010 Epoch [51/250], Train Loss: 0.3087, Test Loss: 0.2697, LR: 0.000010 Epoch [52/250], Train Loss: 0.3072, Test Loss: 0.2814, LR: 0.000010 Epoch [53/250], Train Loss: 0.3055, Test Loss: 0.2764, LR: 0.000010 Epoch [54/250], Train Loss: 0.3059, Test Loss: 0.2757, LR: 0.000010 Epoch [55/250], Train Loss: 0.3034, Test Loss: 0.2769, LR: 0.000010 Epoch [56/250], Train Loss: 0.3034, Test Loss: 0.2735, LR: 0.000010 Epoch [57/250], Train Loss: 0.3025, Test Loss: 0.2757, LR: 0.000010 Epoch [58/250], Train Loss: 0.3016, Test Loss: 0.2667, LR: 0.000010 Epoch [59/250], Train Loss: 0.3015, Test Loss: 0.2683, LR: 0.000010 Epoch [60/250], Train Loss: 0.2990, Test Loss: 0.2630, LR: 0.000010 Epoch [61/250], Train Loss: 0.2982, Test Loss: 0.2630, LR: 0.000010 Epoch [62/250], Train Loss: 0.2974, Test Loss: 0.2713, LR: 0.000010 Epoch [63/250], Train Loss: 0.2971, Test Loss: 0.2684, LR: 0.000010 Epoch [64/250], Train Loss: 0.2975, Test Loss: 0.2673, LR: 0.000010 Epoch [65/250], Train Loss: 0.2970, Test Loss: 0.2638, LR: 0.000010 Epoch [66/250], Train Loss: 0.2944, Test Loss: 0.2694, LR: 0.000010 Epoch [67/250], Train Loss: 0.2938, Test Loss: 0.2571, LR: 0.000010 Epoch [68/250], Train Loss: 0.2941, Test Loss: 0.2632, LR: 0.000010 Epoch [69/250], Train Loss: 0.2926, Test Loss: 0.2618, LR: 0.000010 Epoch [70/250], Train Loss: 0.2922, Test Loss: 0.2614, LR: 0.000010 Epoch [71/250], Train Loss: 0.2927, Test Loss: 0.2647, LR: 0.000010 Epoch [72/250], Train Loss: 0.2921, Test Loss: 0.2598, LR: 0.000010 Epoch [73/250], Train Loss: 0.2903, Test Loss: 0.2660, LR: 0.000010 Epoch [74/250], Train Loss: 0.2907, Test Loss: 0.2681, LR: 0.000010 Epoch [75/250], Train Loss: 0.2903, Test Loss: 0.2581, LR: 0.000010 Epoch [76/250], Train Loss: 0.2913, Test Loss: 0.2607, LR: 0.000010 Epoch [77/250], Train Loss: 0.2881, Test Loss: 0.2557, LR: 0.000010 Epoch [78/250], Train Loss: 0.2869, Test Loss: 0.2620, LR: 0.000010 Epoch [79/250], Train Loss: 0.2875, Test Loss: 0.2687, LR: 0.000010 Epoch [80/250], Train Loss: 0.2877, Test Loss: 0.2613, LR: 0.000010 Epoch [81/250], Train Loss: 0.2877, Test Loss: 0.2578, LR: 0.000010 Epoch [82/250], Train Loss: 0.2853, Test Loss: 0.2500, LR: 0.000010 Epoch [83/250], Train Loss: 0.2862, Test Loss: 0.2581, LR: 0.000010 Epoch [84/250], Train Loss: 0.2837, Test Loss: 0.2594, LR: 0.000010 Epoch [85/250], Train Loss: 0.2829, Test Loss: 0.2560, LR: 0.000010 Epoch [86/250], Train Loss: 0.2840, Test Loss: 0.2586, LR: 0.000010 Epoch [87/250], Train Loss: 0.2820, Test Loss: 0.2607, LR: 0.000010 Epoch [88/250], Train Loss: 0.2824, Test Loss: 0.2600, LR: 0.000010 Epoch [89/250], Train Loss: 0.2824, Test Loss: 0.2706, LR: 0.000010 Epoch [90/250], Train Loss: 0.2821, Test Loss: 0.2593, LR: 0.000010 Epoch [91/250], Train Loss: 0.2806, Test Loss: 0.2530, LR: 0.000010 Epoch [92/250], Train Loss: 0.2818, Test Loss: 0.2550, LR: 0.000010 Epoch [93/250], Train Loss: 0.2808, Test Loss: 0.2449, LR: 0.000010 Epoch [94/250], Train Loss: 0.2805, Test Loss: 0.2529, LR: 0.000010 Epoch [95/250], Train Loss: 0.2798, Test Loss: 0.2535, LR: 0.000010 Epoch [96/250], Train Loss: 0.2801, Test Loss: 0.2488, LR: 0.000010 Epoch [97/250], Train Loss: 0.2796, Test Loss: 0.2491, LR: 0.000010 Epoch [98/250], Train Loss: 0.2786, Test Loss: 0.2641, LR: 0.000010 Epoch [99/250], Train Loss: 0.2793, Test Loss: 0.2476, LR: 0.000010 Epoch [100/250], Train Loss: 0.2769, Test Loss: 0.2426, LR: 0.000010 Epoch [101/250], Train Loss: 0.2787, Test Loss: 0.2454, LR: 0.000010 Epoch [102/250], Train Loss: 0.2772, Test Loss: 0.2469, LR: 0.000010 Epoch [103/250], Train Loss: 0.2759, Test Loss: 0.2448, LR: 0.000010 Epoch [104/250], Train Loss: 0.2766, Test Loss: 0.2518, LR: 0.000010 Epoch [105/250], Train Loss: 0.2753, Test Loss: 0.2400, LR: 0.000010 Epoch [106/250], Train Loss: 0.2742, Test Loss: 0.2442, LR: 0.000010 Epoch [107/250], Train Loss: 0.2754, Test Loss: 0.2443, LR: 0.000010 Epoch [108/250], Train Loss: 0.2749, Test Loss: 0.2521, LR: 0.000010 Epoch [109/250], Train Loss: 0.2749, Test Loss: 0.2528, LR: 0.000010 Epoch [110/250], Train Loss: 0.2729, Test Loss: 0.2514, LR: 0.000010 Epoch [111/250], Train Loss: 0.2734, Test Loss: 0.2453, LR: 0.000010 Epoch [112/250], Train Loss: 0.2728, Test Loss: 0.2383, LR: 0.000010 Epoch [113/250], Train Loss: 0.2719, Test Loss: 0.2477, LR: 0.000010 Epoch [114/250], Train Loss: 0.2721, Test Loss: 0.2453, LR: 0.000010 Epoch [115/250], Train Loss: 0.2714, Test Loss: 0.2527, LR: 0.000010 Epoch [116/250], Train Loss: 0.2715, Test Loss: 0.2389, LR: 0.000010 Epoch [117/250], Train Loss: 0.2716, Test Loss: 0.2426, LR: 0.000010 Epoch [118/250], Train Loss: 0.2725, Test Loss: 0.2531, LR: 0.000010 Epoch [119/250], Train Loss: 0.2707, Test Loss: 0.2356, LR: 0.000010 Epoch [120/250], Train Loss: 0.2728, Test Loss: 0.2451, LR: 0.000010 Epoch [121/250], Train Loss: 0.2714, Test Loss: 0.2481, LR: 0.000010 Epoch [122/250], Train Loss: 0.2702, Test Loss: 0.2386, LR: 0.000010 Epoch [123/250], Train Loss: 0.2699, Test Loss: 0.2429, LR: 0.000010 Epoch [124/250], Train Loss: 0.2693, Test Loss: 0.2478, LR: 0.000010 Epoch [125/250], Train Loss: 0.2703, Test Loss: 0.2472, LR: 0.000010 Epoch [126/250], Train Loss: 0.2687, Test Loss: 0.2426, LR: 0.000010 Epoch [127/250], Train Loss: 0.2693, Test Loss: 0.2362, LR: 0.000010 Epoch [128/250], Train Loss: 0.2684, Test Loss: 0.2529, LR: 0.000010 Epoch [129/250], Train Loss: 0.2670, Test Loss: 0.2463, LR: 0.000010 Epoch [130/250], Train Loss: 0.2686, Test Loss: 0.2364, LR: 0.000010 Epoch [131/250], Train Loss: 0.2673, Test Loss: 0.2378, LR: 0.000010 Epoch [132/250], Train Loss: 0.2670, Test Loss: 0.2402, LR: 0.000010 Epoch [133/250], Train Loss: 0.2695, Test Loss: 0.2476, LR: 0.000010 Epoch [134/250], Train Loss: 0.2663, Test Loss: 0.2379, LR: 0.000010 Epoch [135/250], Train Loss: 0.2690, Test Loss: 0.2382, LR: 0.000010 Epoch [136/250], Train Loss: 0.2681, Test Loss: 0.2428, LR: 0.000010 Epoch [137/250], Train Loss: 0.2661, Test Loss: 0.2472, LR: 0.000010 Epoch [138/250], Train Loss: 0.2674, Test Loss: 0.2410, LR: 0.000010 Epoch [139/250], Train Loss: 0.2660, Test Loss: 0.2391, LR: 0.000010 Epoch [140/250], Train Loss: 0.2667, Test Loss: 0.2304, LR: 0.000010 Epoch [141/250], Train Loss: 0.2661, Test Loss: 0.2547, LR: 0.000010 Epoch [142/250], Train Loss: 0.2649, Test Loss: 0.2442, LR: 0.000010 Epoch [143/250], Train Loss: 0.2668, Test Loss: 0.2369, LR: 0.000010 Epoch [144/250], Train Loss: 0.2653, Test Loss: 0.2371, LR: 0.000010 Epoch [145/250], Train Loss: 0.2643, Test Loss: 0.2376, LR: 0.000010 Epoch [146/250], Train Loss: 0.2638, Test Loss: 0.2384, LR: 0.000010 Epoch [147/250], Train Loss: 0.2651, Test Loss: 0.2485, LR: 0.000010 Epoch [148/250], Train Loss: 0.2632, Test Loss: 0.2292, LR: 0.000010 Epoch [149/250], Train Loss: 0.2640, Test Loss: 0.2418, LR: 0.000010 Epoch [150/250], Train Loss: 0.2642, Test Loss: 0.2519, LR: 0.000010 Epoch [151/250], Train Loss: 0.2628, Test Loss: 0.2302, LR: 0.000010 Epoch [152/250], Train Loss: 0.2632, Test Loss: 0.2518, LR: 0.000010 Epoch [153/250], Train Loss: 0.2627, Test Loss: 0.2313, LR: 0.000010 Epoch [154/250], Train Loss: 0.2629, Test Loss: 0.2322, LR: 0.000010 Epoch [155/250], Train Loss: 0.2619, Test Loss: 0.2387, LR: 0.000010 Epoch [156/250], Train Loss: 0.2630, Test Loss: 0.2315, LR: 0.000010 Epoch [157/250], Train Loss: 0.2616, Test Loss: 0.2288, LR: 0.000010 Epoch [158/250], Train Loss: 0.2642, Test Loss: 0.2340, LR: 0.000010 Epoch [159/250], Train Loss: 0.2623, Test Loss: 0.2284, LR: 0.000010 Epoch [160/250], Train Loss: 0.2623, Test Loss: 0.2406, LR: 0.000010 Epoch [161/250], Train Loss: 0.2615, Test Loss: 0.2406, LR: 0.000010 Epoch [162/250], Train Loss: 0.2636, Test Loss: 0.2358, LR: 0.000010 Epoch [163/250], Train Loss: 0.2612, Test Loss: 0.2242, LR: 0.000010 Epoch [164/250], Train Loss: 0.2615, Test Loss: 0.2364, LR: 0.000010 Epoch [165/250], Train Loss: 0.2615, Test Loss: 0.2336, LR: 0.000010 Epoch [166/250], Train Loss: 0.2592, Test Loss: 0.2331, LR: 0.000010 Epoch [167/250], Train Loss: 0.2598, Test Loss: 0.2309, LR: 0.000010 Epoch [168/250], Train Loss: 0.2602, Test Loss: 0.2482, LR: 0.000010 Epoch [169/250], Train Loss: 0.2605, Test Loss: 0.2315, LR: 0.000010 Epoch [170/250], Train Loss: 0.2611, Test Loss: 0.2299, LR: 0.000010 Epoch [171/250], Train Loss: 0.2597, Test Loss: 0.2387, LR: 0.000010 Epoch [172/250], Train Loss: 0.2587, Test Loss: 0.2381, LR: 0.000010 Epoch [173/250], Train Loss: 0.2594, Test Loss: 0.2341, LR: 0.000010 Epoch [174/250], Train Loss: 0.2600, Test Loss: 0.2283, LR: 0.000010 Epoch [175/250], Train Loss: 0.2597, Test Loss: 0.2394, LR: 0.000010 Epoch [176/250], Train Loss: 0.2590, Test Loss: 0.2284, LR: 0.000010 Epoch [177/250], Train Loss: 0.2608, Test Loss: 0.2243, LR: 0.000010 Epoch [178/250], Train Loss: 0.2590, Test Loss: 0.2357, LR: 0.000010 Epoch [179/250], Train Loss: 0.2588, Test Loss: 0.2364, LR: 0.000010 Epoch [180/250], Train Loss: 0.2588, Test Loss: 0.2422, LR: 0.000010 Epoch [181/250], Train Loss: 0.2599, Test Loss: 0.2389, LR: 0.000010 Epoch [182/250], Train Loss: 0.2588, Test Loss: 0.2221, LR: 0.000010 Epoch [183/250], Train Loss: 0.2573, Test Loss: 0.2378, LR: 0.000010 Epoch [184/250], Train Loss: 0.2570, Test Loss: 0.2338, LR: 0.000010 Epoch [185/250], Train Loss: 0.2576, Test Loss: 0.2366, LR: 0.000010 Epoch [186/250], Train Loss: 0.2585, Test Loss: 0.2385, LR: 0.000010 Epoch [187/250], Train Loss: 0.2563, Test Loss: 0.2342, LR: 0.000010 Epoch [188/250], Train Loss: 0.2573, Test Loss: 0.2271, LR: 0.000010 Epoch [189/250], Train Loss: 0.2560, Test Loss: 0.2330, LR: 0.000010 Epoch [190/250], Train Loss: 0.2572, Test Loss: 0.2269, LR: 0.000010 Epoch [191/250], Train Loss: 0.2571, Test Loss: 0.2352, LR: 0.000010 Epoch [192/250], Train Loss: 0.2574, Test Loss: 0.2471, LR: 0.000010 Epoch [193/250], Train Loss: 0.2569, Test Loss: 0.2271, LR: 0.000010 Epoch [194/250], Train Loss: 0.2581, Test Loss: 0.2260, LR: 0.000010 Epoch [195/250], Train Loss: 0.2545, Test Loss: 0.2294, LR: 0.000010 Epoch [196/250], Train Loss: 0.2558, Test Loss: 0.2320, LR: 0.000010 Epoch [197/250], Train Loss: 0.2566, Test Loss: 0.2283, LR: 0.000010 Epoch [198/250], Train Loss: 0.2562, Test Loss: 0.2263, LR: 0.000010 Epoch [199/250], Train Loss: 0.2561, Test Loss: 0.2318, LR: 0.000010 Epoch [200/250], Train Loss: 0.2563, Test Loss: 0.2382, LR: 0.000010 Epoch [201/250], Train Loss: 0.2563, Test Loss: 0.2424, LR: 0.000010 Epoch [202/250], Train Loss: 0.2554, Test Loss: 0.2291, LR: 0.000010 Epoch [203/250], Train Loss: 0.2562, Test Loss: 0.2325, LR: 0.000005 Epoch [204/250], Train Loss: 0.2555, Test Loss: 0.2365, LR: 0.000005 Epoch [205/250], Train Loss: 0.2554, Test Loss: 0.2318, LR: 0.000005 Epoch [206/250], Train Loss: 0.2560, Test Loss: 0.2231, LR: 0.000005 Epoch [207/250], Train Loss: 0.2538, Test Loss: 0.2339, LR: 0.000005 Epoch [208/250], Train Loss: 0.2570, Test Loss: 0.2361, LR: 0.000005 Epoch [209/250], Train Loss: 0.2547, Test Loss: 0.2378, LR: 0.000005 Epoch [210/250], Train Loss: 0.2542, Test Loss: 0.2318, LR: 0.000005 Epoch [211/250], Train Loss: 0.2547, Test Loss: 0.2338, LR: 0.000005 Epoch [212/250], Train Loss: 0.2541, Test Loss: 0.2402, LR: 0.000005 Epoch [213/250], Train Loss: 0.2552, Test Loss: 0.2254, LR: 0.000005 Epoch [214/250], Train Loss: 0.2549, Test Loss: 0.2297, LR: 0.000005 Epoch [215/250], Train Loss: 0.2532, Test Loss: 0.2330, LR: 0.000005 Epoch [216/250], Train Loss: 0.2540, Test Loss: 0.2274, LR: 0.000005 Epoch [217/250], Train Loss: 0.2542, Test Loss: 0.2317, LR: 0.000005 Epoch [218/250], Train Loss: 0.2534, Test Loss: 0.2246, LR: 0.000005 Epoch [219/250], Train Loss: 0.2534, Test Loss: 0.2256, LR: 0.000005 Epoch [220/250], Train Loss: 0.2533, Test Loss: 0.2386, LR: 0.000005 Epoch [221/250], Train Loss: 0.2544, Test Loss: 0.2220, LR: 0.000005 Epoch [222/250], Train Loss: 0.2538, Test Loss: 0.2375, LR: 0.000005 Early stopping en época 222 Entrenamiento de la Red Neuronal completado. Tiempo de entrenamiento de la Red Neuronal: 159.63 segundos Mejor pérdida de validación: 0.2221
# Guardamos el modelo para su reproducibilidad
# processor (fitted on X_train)
preprocessor_filename_nn = models_path + '/nn_preprocessor.joblib'
joblib.dump(preprocessor, preprocessor_filename_nn)
print(f"Preprocesador de NN guardado en: {preprocessor_filename_nn}")
# y_scaler (fitted on y_train)
y_scaler_filename_nn = models_path + '/nn_y_scaler.joblib'
joblib.dump(y_scaler, y_scaler_filename_nn)
print(f"Escalador de targets de NN guardado en: {y_scaler_filename_nn}")
# Parámetros aprendidos por el modelo de Red Neuronal.
model_state_filename_nn = models_path + '/nn_model_state.pth'
torch.save(model.state_dict(), model_state_filename_nn)
print(f"Estado del modelo NN guardado en: {model_state_filename_nn}")Preprocesador de NN guardado en: ../Modelo_Entrenado//nn_preprocessor.joblib Escalador de targets de NN guardado en: ../Modelo_Entrenado//nn_y_scaler.joblib Estado del modelo NN guardado en: ../Modelo_Entrenado//nn_model_state.pth
plt.figure(figsize=(10, 6))
plt.plot(train_losses, label='Training Loss', color=naturgy_orange_palette[0], linewidth=2)
plt.plot(test_losses, label='Test/Validation Loss', color=naturgy_blue_palette[0], linewidth=2)
plt.title('Curvas de Aprendizaje de la Red Neuronal', fontsize=14, color=naturgy_blue_palette[0], fontweight='bold')
plt.xlabel('Epochs', fontsize=12, color=naturgy_blue_palette[1])
plt.ylabel('Loss (MSE)', fontsize=12, color=naturgy_blue_palette[1])
plt.grid(True, alpha=0.3, linestyle='--')
plt.legend(fontsize=10)
best_epoch = test_losses.index(min(test_losses))
plt.axvline(x=best_epoch, color=naturgy_blue_palette[2], linestyle='--',
label=f'Mejor modelo (época {best_epoch+1})')
plt.tight_layout()
plt.show()
La gráfica se ve bastante bien, tanto training como test bajan de forma rápida indicando que el modelo está generalizado bien.
Probé con otro tipo de redes neuronales más específicos para series temporales, pero o por falta de lags, rollings etc o porque no preparo bien los datos los resultados no mejoran, eso sí TSLM da gráficas de loss vs epoch bastante estables.
Funciones útiles para cargar y visualizar modelos
def evaluate_model(y_true, y_pred, model_name):
results = {
target: {
'R2': r2_score(y_true[target], y_pred[:, i]),
'RMSE': np.sqrt(mean_squared_error(y_true[target], y_pred[:, i])),
'MAE': mean_absolute_error(y_true[target], y_pred[:, i]),
'sMAPE': np.mean(np.abs((y_true[target] - y_pred[:, i]) / (y_true[target] + 1e-9))) * 100
}
for i, target in enumerate(y_true.columns)
}
df_results = pd.DataFrame(results).T
print(f"--- Resultados para {model_name} ---")
display(df_results)
return df_resultsdef predict_with_neural_network(nn_model, X_data, nn_X_preprocessor, nn_y_scaler):
"""
Realiza predicciones usando un modelo de red neuronal PyTorch,
incluyendo preprocesamiento de X y escalado inverso de y.
"""
if nn_model is None or nn_X_preprocessor is None or nn_y_scaler is None:
print("Error: El modelo NN o sus componentes no están disponibles para la predicción.")
return None
# 1. Preprocesar X_data con el preprocesador de la NN
X_data_scaled = nn_X_preprocessor.transform(X_data)
# 2. Realizar predicciones con el modelo NN
nn_model.eval()
with torch.no_grad():
y_pred_scaled = nn_model(torch.tensor(X_data_scaled, dtype=torch.float32)).numpy()
# 3. Revertir el escalado de las predicciones
y_pred_original_scale = nn_y_scaler.inverse_transform(y_pred_scaled)
return y_pred_original_scaledef generate_comparison_plots(y_true_df, y_pred_xgb_data, y_pred_nn_data, dataset_name):
"""
Genera y muestra gráficos de comparación para valores reales vs. predichos.
"""
y_pred_xgb_df = None
if y_pred_xgb_data is not None:
y_pred_xgb_df = pd.DataFrame(y_pred_xgb_data, columns=y_true_df.columns, index=y_true_df.index)
y_pred_nn_df = None
if y_pred_nn_data is not None:
y_pred_nn_df = pd.DataFrame(y_pred_nn_data, columns=y_true_df.columns, index=y_true_df.index)
num_targets = len(y_true_df.columns)
fig_height = max(15, num_targets * 5)
fig, axes = plt.subplots(num_targets, 1, figsize=(18, fig_height), sharex=True)
if num_targets == 1:
axes = [axes]
for i, target_col in enumerate(y_true_df.columns):
ax = axes[i]
y_true_df[target_col].plot(ax=ax, label=f'Valor Real ({dataset_name})', color='black', alpha=0.7)
if y_pred_xgb_df is not None:
y_pred_xgb_df[target_col].plot(ax=ax, label=f'Pred. XGBoost ({dataset_name})',
color=naturgy_blue_palette[0], linestyle='--')
if y_pred_nn_df is not None:
y_pred_nn_df[target_col].plot(ax=ax, label=f'Pred. PyTorch ({dataset_name})',
color=naturgy_orange_palette[0], linestyle=':')
ax.set_title(f'Comparación de Predicciones para {target_col} ({dataset_name})',
fontsize=12, color=naturgy_blue_palette[0], fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3, linestyle='--')
ax.set_ylabel('Valor', fontsize=10, color=naturgy_blue_palette[1])
plt.tight_layout()
plt.show()def load_xgb_pipeline_model(model_path):
"""Carga un pipeline de XGBoost guardado con joblib."""
try:
pipeline = joblib.load(model_path)
print(f"Pipeline XGBoost cargado desde: {model_path}")
return pipeline
except FileNotFoundError:
print(f"ERROR: No se encontró el archivo del pipeline XGBoost: {model_path}")
return NoneResultados (en Conjunto de Validación)
y_pred_xgb_val = xgb_pipeline.predict(X_val)
y_pred_nn_val = predict_with_neural_network(model, X_val, preprocessor, y_scaler)
# --- Evaluación de Modelos en el CONJUNTO DE VALIDACIÓN ---
if y_pred_xgb_val is not None:
results_xgb_val = evaluate_model(y_val, y_pred_xgb_val, 'XGBoost (Validación)')
if y_pred_nn_val is not None:
results_nn_val = evaluate_model(y_val, y_pred_nn_val, 'Red Neuronal (PyTorch) (Validación)')
print("\n--- Comparativa de Tiempos de Entrenamiento ---")
training_times_data = {
'Modelo': ['XGBoost', 'Red Neuronal (PyTorch)'],
'Tiempo de Entrenamiento (s)': [training_time_xgb, training_time_nn]
}
df_training_times = pd.DataFrame(training_times_data)
display(df_training_times)
print("\n--- Gráficos de Comparación en el Conjunto de Validación ---")
generate_comparison_plots(y_val, y_pred_xgb_val, y_pred_nn_val, "Validación")--- Resultados para XGBoost (Validación) ---
| R2 | RMSE | MAE | sMAPE | |
|---|---|---|---|---|
| renewable_H_plus_6 | 0.796624 | 1867.445213 | 1362.814661 | 12.291641 |
| load_H_plus_6 | 0.897553 | 1440.722241 | 1004.700937 | 3.495286 |
| renewable_H_plus_1 | 0.969492 | 723.183029 | 434.534922 | 3.960992 |
| load_H_plus_1 | 0.986647 | 519.790706 | 313.268580 | 1.093828 |
| price_H_plus_1 | 0.960352 | 2.459631 | 1.810902 | 3.101360 |
--- Resultados para Red Neuronal (PyTorch) (Validación) ---
| R2 | RMSE | MAE | sMAPE | |
|---|---|---|---|---|
| renewable_H_plus_6 | 0.798433 | 1859.120601 | 1357.319815 | 12.387131 |
| load_H_plus_6 | 0.810376 | 1960.096825 | 1481.159004 | 5.060964 |
| renewable_H_plus_1 | 0.956510 | 863.453177 | 590.560515 | 5.389653 |
| load_H_plus_1 | 0.962651 | 869.304349 | 648.637170 | 2.283899 |
| price_H_plus_1 | 0.939531 | 3.037564 | 2.335380 | 4.037698 |
--- Comparativa de Tiempos de Entrenamiento ---
| Modelo | Tiempo de Entrenamiento (s) | |
|---|---|---|
| 0 | XGBoost | 177.591160 |
| 1 | Red Neuronal (PyTorch) | 159.627548 |
--- Gráficos de Comparación en el Conjunto de Validación ---

Conclusiones de los modelos
- Los resultados para el Horizonte de predicciones de H+1 son más precisos que los calculados para H+6.
- Por lo general XGBoost tiene mejor rendimiento en casi todos los apartados R2 superior etc.
- La variable más difícil de calcular para ambos modelos es la Generación de Renovables + 6H posiblemente porque depende mucho de factores difíciles de calcular como puede ser la generación eólica y solar.
- El tiempo de entrenamiento es similar en ambos modelos.
En general el modelo más robusto con los datos de validación es XGBoost
8 Prueba Final
Carga de Modelos y Preprocesadores
def load_neural_network_components(preprocessor_path, y_scaler_path, model_state_path,
ModelClass, input_size, output_size):
"""
Carga un modelo de Red Neuronal PyTorch y sus componentes asociados.
ModelClass es la clase del modelo (ej. MLP).
input_size y output_size son las dimensiones necesarias para instanciar ModelClass.
"""
loaded_preprocessor = joblib.load(preprocessor_path)
print(f"Preprocesador de NN cargado desde: {preprocessor_path}")
loaded_y_scaler = joblib.load(y_scaler_path)
print(f"Escalador de targets de NN cargado desde: {y_scaler_path}")
loaded_model = ModelClass(input_size, output_size)
loaded_model.load_state_dict(torch.load(model_state_path))
loaded_model.eval()
print(f"Estado del modelo NN cargado desde: {model_state_path}")
return loaded_model, loaded_preprocessor, loaded_y_scalerprint("\n--- Cargando Modelos y Preprocesadores Guardados ---")
model_filename_xgb = models_path + '/xgb_model-energy_weather.joblib'
loaded_xgb_pipeline = load_xgb_pipeline_model(model_filename_xgb)
preprocessor_filename_nn = models_path + '/nn_preprocessor.joblib'
y_scaler_filename_nn = models_path + '/nn_y_scaler.joblib'
model_state_filename_nn = models_path + '/nn_model_state.pth'--- Cargando Modelos y Preprocesadores Guardados --- Pipeline XGBoost cargado desde: ../Modelo_Entrenado//xgb_model-energy_weather.joblib
try:
numeric_features_list = X_train.select_dtypes(include=np.number).columns.tolist()
actual_input_size_nn = len(numeric_features_list)
actual_output_size_nn = y_train.shape[1]
print(f"Determinadas dimensiones para MLP: input={actual_input_size_nn}, output={actual_output_size_nn}")
except NameError:
print("ADVERTENCIA: X_train o y_train no están definidos. Debes establecer actual_input_size_nn y actual_output_size_nn manualmente.")
actual_input_size_nn = 0
actual_output_size_nn = 0
loaded_nn_model, loaded_nn_preprocessor, loaded_nn_y_scaler = load_neural_network_components(
preprocessor_filename_nn,
y_scaler_filename_nn,
model_state_filename_nn,
MLP,
actual_input_size_nn,
actual_output_size_nn
)Determinadas dimensiones para MLP: input=181, output=5 Preprocesador de NN cargado desde: ../Modelo_Entrenado//nn_preprocessor.joblib Escalador de targets de NN cargado desde: ../Modelo_Entrenado//nn_y_scaler.joblib Estado del modelo NN cargado desde: ../Modelo_Entrenado//nn_model_state.pth
Predicciones y Evaluación en el Conjunto de Prueba Final
print("\n--- Realizando predicciones en el CONJUNTO DE PRUEBA FINAL ---")
y_pred_xgb_final = loaded_xgb_pipeline.predict(X_final_test)
y_pred_nn_final = predict_with_neural_network(
loaded_nn_model, X_final_test, loaded_nn_preprocessor, loaded_nn_y_scaler
)
results_xgb_final = evaluate_model(y_final_test, y_pred_xgb_final, 'XGBoost - Prueba Final')
results_nn_final = evaluate_model(y_final_test, y_pred_nn_final, 'Red Neuronal - Prueba Final')
print("\n--- Gráficos de Comparación en el Conjunto de Prueba Final ---")
generate_comparison_plots(y_final_test, y_pred_xgb_final, y_pred_nn_final, "Prueba Final")
print("\n--- Prueba Final Completada ---")--- Realizando predicciones en el CONJUNTO DE PRUEBA FINAL --- --- Resultados para XGBoost - Prueba Final ---
| R2 | RMSE | MAE | sMAPE | |
|---|---|---|---|---|
| renewable_H_plus_6 | 0.823469 | 1484.932725 | 1186.277652 | 11.958129 |
| load_H_plus_6 | 0.905611 | 1511.633135 | 1121.651317 | 4.124332 |
| renewable_H_plus_1 | 0.974325 | 565.720860 | 402.244674 | 4.114768 |
| load_H_plus_1 | 0.988410 | 533.791674 | 336.940117 | 1.280638 |
| price_H_plus_1 | 0.933953 | 1.836295 | 1.400207 | 2.104182 |
--- Resultados para Red Neuronal - Prueba Final ---
| R2 | RMSE | MAE | sMAPE | |
|---|---|---|---|---|
| renewable_H_plus_6 | 0.847161 | 1381.698408 | 1089.494666 | 11.027053 |
| load_H_plus_6 | 0.846812 | 1925.743498 | 1428.098706 | 5.109202 |
| renewable_H_plus_1 | 0.964282 | 667.256410 | 499.152064 | 5.006259 |
| load_H_plus_1 | 0.974265 | 795.417798 | 610.786716 | 2.239124 |
| price_H_plus_1 | 0.897759 | 2.284692 | 1.781431 | 2.662419 |
--- Gráficos de Comparación en el Conjunto de Prueba Final ---

--- Prueba Final Completada ---
Conclusiones Finales del Rendimiento de los Modelos
XGBoost definitivamente es el modelo más completo para nuestros objetivos. Probablemente su mejor rendimiento se debe a que a diferencia de la red neuronal no necesita una ingeniería tan refinada como la que podría necesitar una rede neuronal. A parte XGBoost es buena trantando datos tabulares estructurados.
La red neuronal destaca por ser más preciso en las renovables + 6h probablemente a que encuentra patrones que XGBoost no puede ya que es la variable más dificl de calcular.
Aunque el modelo tiene unos resultados que se podrían llevar a producción, a lo largo del tiempo estos modelos se podría refinar a medida que se van alimentando con más datos.