Reference Manual¶

NetMD employs computational method that synchronizes MD trajectories using graph-embedding and dynamic time-warping techniques, this allows researchers to better understand complex molecular interactions, revealing previously unidentified patterns.

Here we define a more detailed manual of NetMD.

Usage and Help¶

NetMD is a command-line tool that allows users to analyze molecular dynamics (MD) trajectories by synchronizing them using graph-embedding and dynamic time-warping techniques. The program is designed to work with contact files generated from MD simulations, and it provides a comprehensive set of features for analyzing and visualizing the data.

To start the program you can use the following command:

(env) $ netmd -I INPUTPATH INPUTPATH -F FILES [FILES ...] -f FEATURES -e EDGEFILTER -c CONFIGFILE -o OUTPUTPATH -p / --plotFormat {svg,png} --verbose

Where the command line arguments are defined as:

-h, --help

Show the help message and exit.
-F FILES [FILES …], --Files FILES [FILES …] (required)

Specify one or more contact file paths to be loaded.
-I INPUTPATH INPUTPATH, --InputPath INPUTPATH INPUTPATH (required)

This option accepts two inputs: a directory tree for recursive exploration and a filename prefix shared by all contact files. The program will search for contact files with the specified prefix within the provided directory tree.

Example:
```
-I ./examples/data FullReplica
```
-f FEATURES, --features FEATURES (optional)

Specify the path to the file containing node features. The file must be in tab-separated values (.tsv) format, the first column should be the residue number followed by the features columns. If no path is provided, the unique residue identifier in the contact file will be used as the node feature.

Example:

residue_number

feature_1

feature_2

1

2.653

NZ

2

3.851

NZ

3

2.999

OE1

4

3.729

OD1

5

2.813

OE1

6

3.654

OE1

…

…
-e EDGEFILTER, --edgeFilter EDGEFILTER (optional)

Specify the entropy threshold used to filter the graph edges. (default: 0.1)
-c CONFIGFILE, --configFile CONFIGFILE (optional)

Specify the path to the configuration file containing arguments for Graph2Vec. If no path is provided, default values will be used.
-o OUTPUTPATH, --outputPath OUTPUTPATH (optional)

Specify the output path. If no path is provided, the results folder will be created.
-p, --plotFormat {svg,png} (optional)

Specify the format of the image output. (svg, png; default: svg)
--verbose (optional)

Allow extra prints to monitor progress.

residue_number	feature_1	feature_2
1	2.653	NZ
2	3.851	NZ
3	2.999	OE1
4	3.729	OD1
5	2.813	OE1
6	3.654	OE1
…	…

Program Workflow¶

1. Loading Molecular Dynamics (MD) Replica Files and Network Construction:

The process begins by loading multiple MD simulation trajectory files, each representing a “replica” of the molecular system’s evolution. There are two way to pass the replica files to the program:

iterate_replica_files: Iterate trough the given path of the contact files.
crawl_replica_files: Recursively explore the given directory tree and search for contact files with a common prefix.

For each frame within each replica, a network representation is constructed using NetworkX, where nodes correspond to atoms or residues, and edges represent interactions between them. This network captures the structural relationships at each time point.

2. Preprocessing: Edge Filtering Based on Entropy:

To reduce noise and focus on significant interactions, each replica undergoes preprocessing. During the loading process the load_data function is called, which computes the intra-replica entropy for each edge, reflecting the variability of that interaction within the individual replica’s trajectory. An inter-replica filter is then applied in entropy_filter, removing edges with consistently low entropy across all replicas. This step ensures that only robust and relevant interactions are retained for further analysis.

3. Graph2Vec Embedding: Transforming Networks into Time Series:

Following preprocessing, network representations are fed into g2v_fit_transform, leveraging Graph2Vec[1] to produce numerical embeddings. This embedding process translates the structural essence of each frame into a high-dimensional vector. Consequently, the molecular system’s dynamic evolution is captured as a sequential series of these embedding vectors.

4. Barycenter Calculation: Identifying the Central Trajectory:

To establish a reference trajectory, the barycenter of the embedded time series is computed. The barycenter represents the average or central trajectory, capturing the common dynamic features across all replicas. This serves as a basis for comparing and ranking the individual replicas.

5. Ranking Time Series: Assessing Similarity to the Barycenter:

Each embedded time series (representing a replica) is ranked based on its similarity to the calculated barycenter. This ranking is typically determined using the Dynamic Time Warping (DTW) distance, which accounts for temporal misalignments between trajectories. The replicas with the lowest DTW distance to the barycenter are considered the most similar, and therefore ranked higher.

6. Hierarchical Clustering: Grouping Similar Trajectories:

Finally, hierarchical clustering is performed on the embedded time series to identify groups of replicas with similar dynamic behaviors. This clustering allows researchers to categorize the MD trajectories into distinct clusters, revealing common molecular patterns and identifying variations in the system’s dynamics. This step helps to visualize and understand the diversity of the MD simulations.

Throughout the execution, several plots and files are saved in the out_path specified.

Plots Generated¶

2D Plot of the Replica Embeddings: Visual representation of the embeddings of all replicas.
2D Plot of the Replica Embeddings with Barycenter: The embeddings with the barycenter overlaid.
Iterative Pruning Process Plot: Visualization of the iterative pruning process used to refine the replicas.
Dendrogram Plot: A dendrogram plot with a cut line computed using the elbow method, showing the hierarchical clustering results.

Files Created¶

metadata.tsv: A DataFrame containing information about the replica and frame. Can be used to index the replica embeddings.
subgraphs_emb.pkl: A list of subgraph embeddings belonging to different replicas.
dtw_matrix.tsv: A squared matrix where each value represents the Euclidean distance between aligned time series.
dtw_mapping.txt: This file contains the frame of each replica and the corresponding frame of the barycenter. It illustrates the dynamic time warping (DTW) indexing and the Euclidean distance between frames.
iterative_ranks.tsv: A file containing the iterative pruning results of all replicas based on their distance from the barycenter.

NetMD Reference¶

This reference manual provides a detailed overview of the functions and classes used in the NetMD. We provide this manual to help users understand the functionality and underlying works of the program. The manual is organized into sections, each focusing on a specific function or class.

Reference Manual¶

Usage and Help¶

Program Workflow¶

Plots Generated¶

Files Created¶

NetMD Reference¶

Data Loader¶

create_parser¶

check_entropy¶

iterate_replica_files¶

crawl_replica_files¶

load_data¶

compute_entropy¶

process_file¶

parse_config¶

Embedding¶

g2v_fit_transform¶

entropy_filter¶

dtw_mapping¶

iterative_pruning¶

dim_reduction¶

iterative_dim_reduction¶

Clustering¶

compute_dtwmatrix¶

normalized_dtw¶

hierarchical_clustering_rank¶

elbow_method_cut¶

map_link_to_data¶

Plot Utils¶

plot_emb_rep¶

plot_emb_bary¶

plot_pruning¶

plot_dendogram¶

`create_parser`¶

`check_entropy`¶

`iterate_replica_files`¶

`crawl_replica_files`¶

`load_data`¶

`compute_entropy`¶

`process_file`¶

`parse_config`¶

`g2v_fit_transform`¶

`entropy_filter`¶

`dtw_mapping`¶

`iterative_pruning`¶

`dim_reduction`¶

`iterative_dim_reduction`¶

`compute_dtwmatrix`¶

`normalized_dtw`¶

`hierarchical_clustering_rank`¶

`elbow_method_cut`¶

`map_link_to_data`¶

`plot_emb_rep`¶

`plot_emb_bary`¶

`plot_pruning`¶

`plot_dendogram`¶