HDF5 output structure
=====================

.. Note::
    The output of a NuRadioMC simulation is saved in the HDF5 file format, as well as (optionally) in ``.nur`` files.
    The HDF5 files contain mostly high-level output parameters in a standard, table-like structure.
    It does not include the simulated voltage traces. For more advanced analyses (e.g. reconstruction),
    you probably need to use the ``.nur`` files.
    The data structure of ``.nur`` files is explained :doc:`here </NuRadioReco/pages/event_structure>`.

.. Note::
    This page outlines the structure of the HDF5 files v3.0. Find the structure of :doc:`v2.2 here </NuRadioMC/pages/HDF5_structures_history/HDF5_v2.2>`.


Opening the HDF5 file
---------------------
The HDF5 file can be opened using the ``h5py`` module:

.. code-block:: Python

    import h5py

    f = h5py.File("/path/to/hdf5_file", mode='r')
    attributes = f.attrs

    (...)
    f.close()

If you have many HDF5 files, for example because you ran a simulation parallelized over multiple energy bins,
NuRadioMC contains a convenience function to correctly merge these files -
see :ref:`here <NuRadioMC/pages/Manuals/running_on_a_cluster:4. Merge individual hdf5 output files>` for instructions.

What's behind the HDF5 files
----------------------------
The hdf5 file is created in NuRadioMC/simulation/simulation.py A list of vertices with different arrival direction
(zenith and azimuth) and energy is provided by the event generator. Starting from the vertex, several sub-showers are
created along the track. These are not simulated, but the electric field per sub-shower is provided. Sub-showers that
happen within a certain time interval arrive at the antenna simultaneous and interfere constructively, therefore,
they are summed up.

The ``event_group_id`` is the same for all showers that follow the same first interaction.
The ``shower_id`` is unique for every shower. Shower which interfere constructively are combined into one event and have
the same ``event_id`` starting from 0.

  .. image:: event_sketch.png
    :width: 70%

HDF5 structure
--------------
The HDF5 files can be thought of as a structured dictionary:

- The top level :ref:`attributes <NuRadioMC/pages/HDF5_structure:HDF5 file attributes>`, which can be accessed through ``f.attrs``, contain some top-level information about the simulation.
- The :ref:`individual keys <NuRadioMC/pages/HDF5_structure:HDF5 file contents>` contain some properties (energy, vertex, ...) for each stored event or shower.
- Finally, the ``station_<station_id>`` key contains slightly more detailed information (triggers, propagation times, amplitudes...) at the level of individual channels :ref:`for each station <NuRadioMC/pages/HDF5_structure:Station data>`. Each station group has its own attributes (``f[station_<station_id>].attrs``)

HDF5 file attributes
____________________

The top-level attributes can be accessed using ``f.attrs``. These contain:

    .. _hdf5-attrs-table:

    .. csv-table:: HDF5 attributes
        :header-rows: 1
        :widths: auto
        :delim: |

        Key | Description
        ``NuRadioMC_EvtGen_version`` ``NuRadioMC_EvtGen_version_hash`` ``NuRadioMC_version`` ``NuRadioMC_version_hash`` | Versions of the generator/framework as integer as hash
        ``Emin`` ``Emax`` | Define energy range for neutrino energies
        ``phimax`` ``phimin`` | Define azimuth range for incoming neutrino directions
        ``thetamax`` ``thetamin`` | Define zenith range for incoming neutrino directions
        ``flavors`` | A list of particle flavors that were simulated, using the PDG convention.
        ``n_events`` | Total number of generated/simulated events(including those that did not trigger)
        ``fiducial_xmax`` ``fiducial_xmin`` ``fiducial_ymax`` ``fiducial_ymin`` ``fiducial_zmax`` ``fiducial_zmin`` / ``fiducial_rmax`` ``fiducial_rmin`` ``fiducial_zmax`` ``fiducial_zmin`` | Specify the simulated qubic/cylindrical fiducial volume.  An event has to produce an interaction within this volume. However, in case of a muon or tau CC interaction the first interaction can occur outside
        ``rmax`` ``rmin`` ``zmax`` ``zmin`` / ``xmax`` ``xmin`` ``ymax`` ``ymin`` ``zmax`` ``zmin`` | Specify the qubic/cylindrical volume in which neutrino interactions are generated
        ``volume`` | Volume of the above specified volume
        ``area`` | Surface area of the above specified volume
        ``start_event_id`` | ``event_id`` of the first event in the file
        ``trigger_names`` | List of the names of the different triggers simulated
        ``Tnoise`` | (explicit) noise temperature used in simulation
        ``n_samples`` | Samples of the to-be generated antenna signals
        ``config`` | The (yaml-style) config file used for the simulation
        ``deposited`` |
        ``detector`` | The (json-format) detector description used for the simulation
        ``dt`` | The time resolution, i.e. the inverse of the sampling rate used for the simulation. This is not necessarily the same as the sampling rate of the simulated channels!


The station-level attributes can be accessed using ``f[station_<station_id>].attrs``. The first two attributes ``Vrms`` and ``bandwidth`` also exist on the top-level and refer to the corresponding to the first station/channel pair.

    .. _hdf5-station-attrs-table:

    .. csv-table:: HDF5 station attributes
        :header-rows: 1
        :widths: auto
        :delim: |

        Key | Description
        ``Vrms`` | RMS of the voltage used as thermal noise floor :math:`v_{n} = (k_{B} \, R \, T \, \Delta f) ^ {0.5}`. See the relevant section "Maximum transfer of noise power" in this `wiki article <https://en.wikipedia.org/wiki/Johnson%E2%80%93Nyquist_noise>`_ . Determine from ``Tnoise`` and ``bandwidth`` (see below).
        ``Vrms_trigger`` | (Optional) Same as ``Vrms`` but for the trigger channels if they were simulated with a different response.
        ``bandwidth`` | Bandwidth is above equation. Calculated as the integral over the simulated filter response (`filt`) squared: :math:`\Delta f = np.trapz(np.abs(filt) ** 2, ff)`.
        ``antenna_positions`` | Relative position of all simulated antennas (channels)

HDF5 file contents
__________________
The HDF5 file contains the following items. Listed are the ``key`` and the ``shape`` of each HDF5 dataset, where ``n_events`` is the number of events stored in the file and ``n_showers``
is the number of showers (which may be larger than the number of events), and ``n_triggers`` is the number of different triggers simulated. Each "row" correspond to a particle shower which can produce radio emission.

    .. _hdf5-items-table:

    .. csv-table:: HDF5 items
        :header-rows: 1
        :widths: auto
        :delim: |

        Key | Shape | Description
        ``event_group_ids`` | (``n_showers``) | Specifies the event id to which the corresponding shower belongs (``n_events = len(unique(event_group_ids)))``)
        ``xx`` ``yy`` ``zz`` | (``n_showers``) | Specifying coordinates of interaction vertices
        ``vertex_times`` | (``n_showers``) | Time at the interaction vertex. The neutrino interaction (= first interaction) is defined as time 0
        ``azimuths`` ``zeniths`` | (``n_showers``) | Angle Specifying the neutrino incoming direction (``azimuths = 0`` points east)
        ``energies`` | (``n_showers``) | Energy of the parent particle of a shower. This is typically the energy of the neutrino (for showers produced at the first interaction: all flavor NC, electron CC interactions) or the energy of a muon or tau lepton when those are producing secondary energy losses
        ``shower_energies`` | (``n_showers``) | Energy of the shower which is used to determine the radio emission
        ``flavors`` | (``n_showers``) | Same as above (the parent of an electromagnetic cascade in an electron CC interaction is the neutrino)
        ``inelasticity`` | (``n_showers``) | Inelasticity of the first interaction
        ``interaction_type`` | (``n_showers``) | Interaction type producing the shower (for the first interaction that can be "nc" or "cc")
        ``multiple_triggers`` | (``n_showers``, ``n_triggers``) | Information which exact trigger fired each shower. The different triggers are specified in the attributes (``f.attrs["triggers"]``). The order of ``f.attrs["triggers"]`` matches that in ``multiple_triggers``
        ``triggered`` | (``n_showers``) | A boolean; ``True`` if any trigger fired for this shower, ``False`` otherwise
        ``trigger_times`` | (``n_showers``, ``n_triggers``) | The trigger times (relative to the first interaction) at which each shower triggered. If there are multiple stations, this will be the earliest trigger time.
        ``n_interaction`` | (``n_showers``) | Hierarchical counter for the number of showers per event (also accounts for showers which did not trigger and might not be saved)
        ``shower_ids`` | (``n_showers``) | Hierarchical counter for the number of triggered showers
        ``shower_realization_ARZ`` | (``n_showers``) | Which realization from the ARZ shower library was used for each shower (only if ARZ was used for signal generation).
        ``shower_type`` | (``n_showers``) | Type of the shower (so far we only have "em" and "had")
        ``weights`` | (``n_showers``) | Weight for the probability that the neutrino reached the interaction vertex taking into account the attenuation from the earth (Does not include interaction probability in the volume)


Station data
____________
In addition, the HDF5 file contains a key for each station in the simulation.
The station contains more detailed information for each station. Some parameters are per event and
some parameters are per shower. See https://doi.org/10.22323/1.395.1231 for a description of how showers relate to events.
``m_events`` and ``m_showers`` refer to the number of events and showers that triggered the station. NOTE: The simple table
structure of hdf5 files can not capture the complex relation between events and showers in all cases. Some fields can be ambiguous
(e.g. `trigger_times` that only lists the last trigger that a shower generated).
For more advanced analyses, please use the ``*.nur`` files.
The ``event_group_id`` is the same as in the global dictionary. Therefore you can check for one event with
an ``event_group_id`` which stations contain the same ``event_group_id`` and retrieve the information, which
station triggered, with which amplitude, etc. The same approach works for ``shower_id``.

    .. _hdf5-station-table:

    .. csv-table:: HDF5 station items
        :header-rows: 1
        :widths: auto
        :delim: |

        Key | Shape | Description
        ``event_group_ids`` | (``m_events``) | The event group ids of the triggered events in the selected station
        ``event_group_id_per_shower`` | (``m_showers``) | The event group id of every shower that triggered the selected station
        ``event_ids`` | (``m_events``) | The event ids of each event that triggered in that station for every event group id. These are unique only within each separate event group, and start from 0.
        ``event_id_per_shower`` | (``m_showers``) | The event ids of each event that triggered in that station. This one is for every shower
        ``shower_id`` | (``m_showers``) | The Shower ids of showers that triggered the selected station
        ``max_amp_shower_and_ray`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``) | Maximum amplitude per shower, channel and ray tracing solution.
        ``maximum_amplitudes`` | (``m_events``, ``n_channels``) | Maximum amplitude per event and channel
        ``maximum_amplitudes_envelope`` | (``m_events``, ``n_channels``) | Maximum amplitude of the hilbert envelope for each event and channel
        ``multiple_triggers`` | (``m_showers``, ``n_triggers``) | A boolean array that specifies if a shower contributed to an event that fulfills a certain trigger. The index of the trigger can be translated to the trigger name via the attribute ``trigger_names``.
        ``multiple_triggers_per_event`` | (``m_events``, ``n_triggers``) | A boolean array that specifies if each event fulfilled a certain trigger. The index of the trigger can be translated to the trigger name via the attribute ``trigger_names``.
        ``polarization`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``, ``3``) | 3D coordinates of the polarization vector at the antenna in cartesian coordinates. (The receive vector (which is opposite to the propagation direction) was used to rotate from spherical/on-sky coordinates to cartesian coordinates). The polarization vector does not include any propagation effects that could change the polarization, such as different reflectivities at the surface for the p and s polarization component.
        ``ray_tracing_C0`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``) | One of two parameters specifying the **analytic** ray tracing solution. Can be used to retrieve the solutions without having to re-run the ray tracer.
        ``ray_tracing_C1`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``) | One of two parameters specifying the **analytic** ray tracing solution. Can be used to retrieve the solutions without having to re-run the ray tracer.
        ``ray_tracing_reflection`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``) | The number of bottom reflections (This variable is only non-zero if a reflection layer was defined in the ice model and if 'propagation.n_reflections' was set to a value larger than 0 in the config.yaml file.)
        ``ray_tracing_reflection_case`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``) | Only relevant for bottom reflections. 1: rays start upwards, 2: rays start downwards
        ``ray_tracing_solution_type`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``) | The type of the ray tracing solution. 0: direct, 1: refracted, 2: reflected (off the surface)  (A refracted ray is defined as a ray that has a turning point, i.e. if it transitions from upward going to downward going; a reflected ray is defined if it has a surface reflection.)
        ``focusing_factor`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``) |  The focusing factor calculated by the propagation module.
        ``launch_vectors`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``, ``3``) | 3D (Cartesian) coordinates of the launch vector of each ray tracing solution, per shower and channel.
        ``receive_vectors`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``, ``3``) | 3D (Cartesian) coordinates of the receive vector of each ray tracing solution, per shower and channel.
        ``time_shower_and_ray`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``) | The "signal time" per shower and raytracing solution. I.e., the time of the signal arriving at the DAQ including, e.g., cable delay, ...
        ``travel_distances`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``) | The distance travelled by each ray tracing solution to a specific channel
        ``travel_times`` | (``m_showers``, ``n_channels``, ``n_ray_tracing_solutions``) | The time travelled by each ray tracing solution to a specific channel
        ``triggered`` | (``m_showers``) | Whether each shower contributed to an event that satisfied any trigger condition
        ``triggered_per_event`` | (``m_events``) | Whether each event fulfilled any trigger condition.
        ``trigger_times`` | (``m_showers``, ``n_triggers``) | The trigger times for each shower and trigger. IMPORTANT: A shower can potentially generate multiple events. Then this field is ambiguous, as only a single trigger time per shower can be saved. In that case, the latest trigger time is saved into this field.
        ``trigger_times_per_event`` | (``m_events``, ``n_triggers``) | The trigger times per event.