API for scripting

Scripting with tridesclous can avoid unnecessary click and play with the GUI in parameter dialogs if you need to process many files. Almost everything can be done 3 classes:

  • DataIO for configuration of the dataset, format selection, PRB file assignement, …
  • CatalogueConstructor run all steps to construct the catalogue : signal processing, fetaure, clustering, …
  • Peeler run the template matching engine

Of course everything done by script can still be check and modify with the GUI (MainWindow, CatalogueWindow and PeelerWindow).

The best is to read examples in the git repo

Classes API

class tridesclous.dataio.DataIO(dirname='test')[source]

Class to acces the dataset (raw data, processed, catalogue, spikes) in read/write mode.

All operations on the dataset are done througth that class.

The dataio :
  • work in a path. Almost everything is persistent.
  • needed by CatalogueConstructor and Peeler
  • have a datasource member that access raw data
  • store/load processed signals
  • store/load spikes
  • store/load the catalogue
  • deal with sevral channel groups given a PRB file
  • deal with several segment of recording (aka several files for raw data)
  • export the results (labeled spikes) to differents format

The underlying data storage is a simple tree on the file system. Everything is organised as simple as possible in sub folder (by channel group then segment index). In each folder:

  • arrays.json describe the list of numpy array (name, dtype, shape)
  • XXX.raw are the raw numpy arrays and load with a simple memmap.
  • some array are struct arrays (aka array of struct)

The datasource system is based on neo.rawio so all format in neo.rawio are available in tridesclous. neo.rawio is able to read chunk of signals indexed on time axis and channel axis.

The raw dataset do not need to be inside the working directory but can be somewhere outside. The info.json describe the link to the datasource (raw data)

Many raw dataset are saved by the device with an underlying int16. DataIO save the processed signals as float32 by default. So if you have a 10Go raw dataset tridesclous will need at least 20 Go more for storage of the processed signals.

Usage:

# initialize a directory
dataio = DataIO(dirname='/path/to/a/working/dir')

# set a data source
filenames = ['file1.raw', 'file2.raw']
dataio.dataio.set_data_source(type='RawData', filenames=filenames, 
                            sample_rate=10000, total_channel=16, dtype='int16')

# set a PRB file
dataio.set_probe_file('/path/to/a/file.prb')
# or dowload it
dataio.download_probe('kampff_128', origin='spyking-circus')

# check lenght and channel groups
print(dataio)
already_processed(seg_num=0, chan_grp=0, length=None)[source]

Check if the segment is entirely processedis already computed until length

append_spikes(seg_num=0, chan_grp=0, spikes=None)[source]

Append spikes.

channel_group_label(chan_grp=0)[source]

Label of channel for a group.

download_probe(probe_name, origin='kwikteam')[source]

Download a prb file from github into the working dir.

The spiking-circus and kwikteam propose a list prb file. See:

probe_name: str
the name of file in github
origin: ‘kwikteam’ or ‘spyking-circus’
github project
export_spikes(export_path=None, split_by_cluster=False, use_cell_label=True, formats=None)[source]

Export spikes to other format (csv, matlab, excel, …)

export_path: str or None
export path. If None (default then inside working dir)
split_by_cluster: bool (default False)
Each cluster is split to a diffrent file or not.
use_cell_label: bool (deafult True)
if true cell_label is used if false cluster_label is used
formats: ‘csv’ or ‘mat’ or ‘xlsx’
The output format.
flush_processed_signals(seg_num=0, chan_grp=0, processed_length=-1)[source]

Flush the underlying memmap for processed signals.

flush_spikes(seg_num=0, chan_grp=0)[source]

Flush underlying memmap for spikes.

get_geometry(chan_grp=0)[source]

Get the geometry for a given channel group in a numpy array way.

get_peak_values(seg_num=0, chan_grp=0, sample_indexes=None, channel_indexes=None)[source]

Extract peak values

get_processed_length(seg_num=0, chan_grp=0)[source]

Get the length in sample how already processed part of the segment.

get_segment_length(seg_num)[source]

Segment length (in sample) for a given segment index

get_segment_shape(seg_num, chan_grp=0)[source]

Segment shape for a given segment index and channel group.

get_signals_chunk(seg_num=0, chan_grp=0, i_start=None, i_stop=None, signal_type='initial', pad_width=0)[source]

Get a chunk of signal for for a given segment index and channel group.

The signal can be the ‘initial’ (aka raw signal), the none filetered signals or the ‘processed’ signal.

seg_num: int
segment index
chan_grp: int
channel group key
i_start: int or None
start index (included)
i_stop: int or None
stop index (not included)
signal_type: str
‘initial’ or ‘processed’
pad_width: int (0 default)
Add optional pad on each sides usefull for filtering border effect
get_some_waveforms(seg_num=None, seg_nums=None, chan_grp=0, peak_sample_indexes=None, n_left=None, n_right=None, waveforms=None, channel_indexes=None)[source]

Exctract some waveforms given sample_indexes

seg_num is int then all spikes come from same segment

if seg_num is None then seg_nums is an array that contain seg_num for each spike.

get_spikes(seg_num=0, chan_grp=0, i_start=None, i_stop=None)[source]

Read spikes

iter_over_chunk(seg_num=0, chan_grp=0, i_stop=None, chunksize=1024, pad_width=0, with_last_chunk=False, **kargs)[source]

Create an iterable on signals. (‘initial’ or ‘processed’)

for ind, sig_chunk in data.iter_over_chunk(seg_num=0, chan_grp=0, chunksize=1024, signal_type=’processed’):
do_something_on_chunk(sig_chunk)
load_catalogue(name='initial', chan_grp=0)[source]

Load the catalogue dict.

nb_channel(chan_grp=0)[source]

Number of channel for a channel group.

reset_processed_signals(seg_num=0, chan_grp=0, dtype='float32')[source]

Reset processed signals.

reset_spikes(seg_num=0, chan_grp=0, dtype=None)[source]

Reset spikes.

save_catalogue(catalogue, name='initial')[source]

Save the catalogue made by CatalogueConstructor and needed by Peeler inside the working dir.

Note that you can construct several catalogue for the same dataset to compare then just change the name. Different folder name so.

set_channel_groups(channel_groups, probe_filename='channels.prb')[source]

Set manually the channel groups dictionary.

set_data_source(type='RawData', **kargs)[source]

Set the datasource. Must be done only once otherwise raise error.

type: str (‘RawData’, ‘Blackrock’, ‘Neuralynx’, …)
The name of the neo.rawio class used to open the dataset.
kargs:
depends on the class used. They are transimted to neo class. So see neo doc for kargs
set_probe_file(src_probe_filename)[source]

Set the probe file. The probe file is copied inside the working dir.

set_signals_chunk(sigs_chunk, seg_num=0, chan_grp=0, i_start=None, i_stop=None, signal_type='processed')[source]

Set a signal chunk (only for ‘processed’)

class tridesclous.catalogueconstructor.CatalogueConstructor(dataio, chan_grp=None, name='catalogue_constructor')[source]

The goal of CatalogueConstructor is to construct a catalogue of template (centroids) for the Peeler.

For so the CatalogueConstructor will:
  • preprocess a duration of data
  • detect peaks
  • extract some waveform
  • compute some feature from these waveforms
  • find cluster
  • compute some metrics ontheses clusters
  • enable some manual trash/merge/split

At the end we can make_catalogue_for_peeler, this construct everything usefull for the Peeler : centroids (median for each cluster: the centroids, first and secnd derivative, median and mad of noise, …

You need to have one CatalogueConstructor for each channel_group.

Since all operation are more or less slow each internal arrays is by default persistent on the disk (memory_mode=’memmap’). With this when you instanciate a CatalogueConstructor you load all existing arrays already computed. For that tridesclous mainly use the numpy structured arrays (array of struct) with memmap. So, hacking internal arrays of the CatalogueConstructor should be easy outside python and tridesclous.

You can explore/save/reload internal state by borwsing this directory: path_of_dataset/channel_group_XXX/catalogue_constructor

Usage:

from tridesclous  import Dataio, CatalogueConstructor
dirname = '/path/to/dataset'
dataio = DataIO(dirname=dirname)
cc = CatalogueConstructor(dataio, chan_grp=0)

# preprocessing
cc.set_preprocessor_params(chunksize=1024,
    highpass_freq=300.,
    lowpass_freq=5000.,
    common_ref_removal=True,
    pad_width=64,
    peak_sign='-', 
    relative_threshold=5.5,
    peak_span_ms=0.3,
    )
cc.estimate_signals_noise(seg_num=0, duration=10.)
cc.run_signalprocessor(duration=60.)

# waveform/feature/cluster
cc.extract_some_waveforms(n_left=-25, n_right=40, mode='rand')
cc.clean_waveforms(alien_value_threshold=55.)
cc.project(method='global_pca', n_components=7)
cc.find_clusters(method='kmeans', n_clusters=7)

# manual stuff
cc.trash_small_cluster(n=5)
cc.order_clusters()

# do this before peeler
cc.make_catalogue_for_peeler()

For more example see examples.

Persitent atributes:

CatalogueConstructor have a mechanism to store/load some attributes in the dirname of dataio. This is usefull to continue previous woprk on a catalogue.

The attributes are almost all numpy arrays and stored using the numpy.memmap mechanism. So prefer local fast fast storage like ssd.

Here the list of theses attributes with shape and dtype. N is the total number of peak detected. M is the number of selected peak for waveform/feature/cluser. C is the number of clusters

  • all_peaks (N, ) dtype = [(‘index’, ‘int64’), (‘cluster_label’, ‘int64’), (‘channel’, ‘int64’), (‘segment’, ‘int64’), (‘extremum_amplitude’, ‘float64’)]
  • signals_medians (nb_sample, nb_channel, ) float32
  • signals_mads (nb_sample, nb_channel, ) float32
  • clusters (c, ) dtype= [(‘cluster_label’, ‘int64’), (‘cell_label’, ‘int64’), (‘extremum_channel’, ‘int64’), (‘extremum_amplitude’, ‘float64’), (‘waveform_rms’, ‘float64’), (‘nb_peak’, ‘int64’), (‘tag’, ‘U16’), (‘annotations’, ‘U32’), (‘color’, ‘uint32’)]
  • some_peaks_index (M) int64
  • some_waveforms (M, width, nb_channel) float32
  • some_features (M, nb_feature) float32
  • channel_to_features (nb_chan, nb_component) bool
  • some_noise_snippet (nb_noise, width, nb_channel) float32
  • some_noise_index (nb_noise, ) int64
  • some_noise_features (nb_noise, nb_feature) float32
  • centroids_median (C, width, nb_channel) float32
  • centroids_mad (C, width, nb_channel) float32
  • centroids_mean (C, width, nb_channel) float32
  • centroids_std (C, width, nb_channel) float32
  • spike_waveforms_similarity (M, M) float32
  • cluster_similarity (C, C) float32
  • cluster_ratio_similarity (C, C) float32
apply_all_steps(params, verbose=True)[source]
auto_merge_high_similarity(threshold=0.95)[source]

Recursively merge all pairs with similarity hihger that a given threshold

clean_peaks(alien_value_threshold=200, mode='extremum_amplitude')[source]

Detect alien peak with very high values. Tag then with alien (label=-9) This prevent then to be selected. Usefull to remove artifact.

alien_value_threshold: float or None
threhold for alien peak
mode=’peak_value’ / ‘full_waveform’
‘peak_value’ use only the peak value very fast because already measured ‘full_waveform’ use the full waveform for this and can be slow.
create_savepoint(name=None)[source]

this create a copy of the entire catalogue_constructor subdir Usefull for the UI when the user wants to snapshot and try tricky merge/split.

estimate_signals_noise(seg_num=0, duration=10.0)[source]

This estimate the median and mad on processed signals on a short duration. This will be necessary for normalisation in next steps.

Note that if the noise is stable even a short duratio is OK.

seg_num: int
segment index
duration: float
duration in seconds
extract_some_features(method='global_pca', selection=None, **params)[source]

Extract feature from waveforms.

If selection is None then all peak from some_peak_index are taken.

Else selection is mask bool size all_peaks used for fit and then the tranform is applied on all.

extract_some_noise(nb_snippet=300)[source]

Find some snipet of signal that are not overlap with peak waveforms.

Usefull to project this noise with the same tranform as real waveform and see the distinction between waveforma and noise in the subspace.

find_clusters(method='kmeans', selection=None, order=True, recompute_centroid=True, **kargs)[source]

Find cluster for peaks that have a waveform and feature.

selection is mask bool size all_peaks

flush_info()[source]

Flush info (mainly parameters) to json files.

get_some_waveforms(peaks_index=None, channel_indexes=None, n_left=None, n_right=None)[source]

Get waveforms accros segment based on internal self.some_peak_index forced peaks_index.

make_catalogue_for_peeler(catalogue_name='initial', **kargs)[source]

Make and save catalogue in the working dir for the Peeler.

order_clusters(by='waveforms_rms')[source]

This reorder labels from highest rms to lower rms. The higher rms the smaller label. Negative labels are not reassigned.

re_detect_peak(**kargs)[source]

Peak are detected while run_signalprocessor. But in some case for testing other threshold we can re-detect peak without signal processing.

method : ‘global’ or ‘geometrical’
Method for detection.
engine: ‘numpy’ or ‘opencl’ or ‘numba’
Engine for peak detection.
peak_sign: ‘-’ or ‘+’
Signa of peak.
relative_threshold: int default 7
Threshold for peak detection. The preprocessed signal have units expressed in MAD (robust STD). So 7 is MAD*7.
peak_span_ms: float default 0.3
Peak span to avoid double detection. In second.
reload_data()[source]

Reload persitent arrays.

run_signalprocessor_OLD(duration=60.0, detect_peak=True)[source]

this run (chunk by chunk), the signal preprocessing chain on all segments.

The duration can be clip for very long recording. For catalogue construction the user must have the intuition of how signal we must have to get enough spike to detect clusters. If the duration is too short clusters will not be visible. If too long the processing time will be unacceptable. This totally depend on the dataset (nb channel, spike rate …)

This also detect peak to avoid to useless access to the storage.

duration: float
total duration in seconds for all segment
detect_peak: bool (default True)
Also detect peak.
set_global_params(chunksize=1024, memory_mode='memmap', internal_dtype='float32', mode='dense', sparse_threshold=1.5, n_spike_for_centroid=350, n_jobs=-1)[source]
chunksize: int. default 1024
Length of each chunk for processing. For real time, the latency is between chunksize and 2*chunksize.
memory_mode: ‘memmap’ or ‘ram’ default ‘memmap’
By default all arrays are persistent with memmap but you can also have everything in ram.
internal_dtype: ‘float32’ (or ‘float64’)
Internal dtype for signals/waveforms/features. Support of intteger signal and waveform is planned for one day and should boost the process!!
mode: str dense or sparse
Choose the global mode dense or sparse.
adjacency_radius_um: float or None
When mode=’sparse’ then this must not be None.
set_peak_detector_params(method='global', engine='numpy', peak_sign='-', relative_threshold=7, peak_span_ms=0.3, adjacency_radius_um=None, smooth_radius_um=None)[source]

Set parameters for the peak_detector engine

method : ‘global’ or ‘geometrical’
Method for detection.
engine: ‘numpy’ or ‘opencl’ or ‘numba’
Engine for peak detection.
peak_sign: ‘-’ or ‘+’
Signa of peak.
relative_threshold: int default 7
Threshold for peak detection. The preprocessed signal have units expressed in MAD (robust STD). So 7 is MAD*7.
peak_span_ms: float default 0.3
Peak span to avoid double detection. In millisecond.
set_preprocessor_params(engine='numpy', highpass_freq=300.0, lowpass_freq=None, smooth_size=0, common_ref_removal=False, pad_width=None)[source]

Set parameters for the preprocessor engine

engine=’numpy’ or ‘opencl’
If you have pyopencl installed and correct ICD installed you can try ‘opencl’ for high channel count some critial part of the processing is done on the GPU.
highpass_freq: float dfault 300
High pass cut frquency of the filter. Can be None if the raw dataset is already filtered.
lowpass_freq: float default is None
Low pass frequency of the filter. None by default. For high samplign rate a low pass at 5~8kHz can be a good idea for smoothing a bit waveform and avoid high noise.
smooth_size: int default 0
This a simple smooth convolution kernel. More or less act like a low pass filter. Can be use instead lowpass_freq.
common_ref_removal: bool. False by dfault.
The remove the median of all channel sample by sample.
pad_width: int. default None
size in sample of the margin at the front edge for each chunk to avoid border effect in backward filter. In you don’t known put None then pad_width will be int(sample_rate/highpass_freq)*3 which is quite robust (<5% error) compared to a true offline filtfilt.
split_cluster(label, method='kmeans', **kargs)[source]

This split one cluster by applying a new clustering method only on the subset.

tag_same_cell(labels_to_group)[source]

In some situation in spike burst the amplitude change baldly. In theses cases we can have 2 distinct cluster but the user suspect that it is the same cell. In such cases the user must not merge the 2 clusters because the centroids will represent nothing and and the Peeler.will fail.

Instead we tag the 2 differents cluster as “same cell” so same cell_label.

This is a manual action.

class tridesclous.peeler.Peeler(dataio)[source]

The peeler is core of spike sorting itself. It basically do a template matching on a signals.

This class nedd a catalogue constructed by CatalogueConstructor. Then the compting is applied chunk chunk on the raw signal itself.

So this class is the same for both offline/online computing.

At each chunk, the algo is basically this one:
  1. apply the processing chain (filter, normamlize, ….)
  2. Detect peaks
  3. Try to classify peak and detect the jitter
  4. With labeled peak create a prediction for the chunk
  5. Substract the prediction from the processed signals.
  6. Go back to 2 until there is no peak or only peaks that can’t be labeled.
  7. return labeld spikes from this or previous chunk and the processed signals (for display or recoding)

The main difficulty in the implemtation is to deal with edge because spikes waveforms can spread out in between 2 chunk.

Note that the global latency depend on this paramters:
  • pad_width
  • chunksize
change_params(catalogue=None, engine='geometrical', internal_dtype='float32', chunksize=1024, speed_test_mode=False, **params)[source]
speed_test_mode: only for offline mode create a log file with
run time for each buffers
get_run_times(chan_grp=0, seg_num=0)[source]

need speed_test_mode=True in params