Bioinformatics on Bytes of Life

DeepChopper: A Genomic Language Model that Cleans Up Nanopore Direct RNA Sequencing

Thu, 07 May 2026 00:00:00 +0000

The chimera mystery
#

Direct RNA sequencing (dRNA-seq) on Oxford Nanopore looks, on paper, like a transcriptomics dream. You sequence native RNA molecules end to end, you keep the modifications, and you skip every reverse-transcription and PCR step that has been quietly polluting short-read data for years. For a while, that was the story we were telling ourselves.

A language model enables accurate structural variant detection in whole-genome amplified long-read sequencing

Thu, 23 Jan 2025 00:00:00 +0000

Genomic Language Model Mitigates Chimera Artifacts in Nanopore Direct RNA Sequencing

Thu, 31 Oct 2024 00:00:00 +0000

Aurora Is a Web Application for Visualizing Non-linear Graph

Thu, 04 Apr 2024 00:00:00 +0000

PxBLAT: An Efficient and Ergonomic Python Binding Library for BLAT

Sun, 25 Jun 2023 00:00:00 +0000

Efficient Genomic Interval Search Using SIMD-Enhanced COITree

Sun, 12 Mar 2023 00:00:00 +0000

Background
#

In bioinformatics, researchers frequently analyze various types of genomic data, such as DNA sequencing data, RNA sequencing data, and epigenetic data. Manipulating genomic intervals is a crucial task in comprehending the genetic basis of diseases and identifying potential therapeutic targets. Genomic intervals are defined as regions that span from a starting position to an ending position and can encompass genes, regulatory elements, and other functional elements of the genome. One primary application of genomic interval manipulation is analyzing ChIP-seq data. Moreover, manipulating genomic intervals allows for the integration of ChIP-seq data with other genomic data types, such as gene expression and genetic variations. This integration provides a more comprehensive understanding of biological processes and their contribution to normal development or disease. However, integrating these data types into a single data structure can pose challenges, especially when handling large datasets. Cache Oblivious Interval Trees (COITree), with cache-oblivious design and efficient query algorithms, have the potential to handle and integrate multiple types of genomic data into a single data structure. It stores the intervals in contiguous memory and employs in-order van Emde Boas layout to enhance query performance. The tree is designed to optimize cache performance by reducing the number of cache misses during traversal. However, COITree still suffer from performance bottlenecks, particularly when dealing with large datasets. One approach to addressing this bottleneck is to use Single Instruction Multiple Data (SIMD), which is optimized for vector operations, to improve the performance of COITree. Thus, I hypothesize that the approach is a viable solution for improving the speed and efficiency of genomic interval analysis.

How to Use Noodles Library in Rust

Sat, 04 Mar 2023 00:00:00 +0000

1. Introduction
#

Noodles and Rust-htslib are two widely used Rust libraries for genomic data handling. While both libraries are designed to work with genomic data, they take different approaches to achieve this goal. This blog explores Noodles and compares it with Rust-htslib, while also discussing its potential pitfalls.

Bioinformatics Algorithm Library aka BINARY

Mon, 26 Sep 2022 00:00:00 +0000

The library is a collection of algorithms and data structures that are designed for modern C++ bioinformatics applications. You can use the library in your own projects or as a part of a larger project.

Bioinformatics Toolbox Aka Boss

Sun, 25 Sep 2022 00:00:00 +0000

BOSS is a bioinformatics toolbox, which will contain efficient tools. It is written in modern C++ and is tested exhaustively. It is designed to be easy to use and time-efficient. BOSS is a free software and is distributed under the terms of the GNU General Public License V3.

C++ Development in Bioinformatics

Wed, 15 Jun 2022 00:00:00 +0000

1.1 Config Compile Environment
#

I am currently planning to develop a tool using C++ in both Linux and macOS environments. However, I frequently encounter obstacles in the form of lacking root access to download dependencies using apt-get install -y dependencies directly in Ubuntu. Navigating the complicated dependency chain and compiling each library individually can be time-consuming, often taking a night or even a week to complete. One solution to this issue is to use a package manager such as Conda, which is primarily used in the data science domain. Conda offers support for other languages such as C++, Rust and R as well. Concrete package names may change at any time, and it’s necessary to search for the real package name. Therefore, Conda can be useful tool for installing C++ dependencies, particularly in the bioinformatics domain. It’s worth mentioning that there are several other solutions available for managing C++ dependencies such as Vcpkg, Conan, and I use CPM as an alternative option.

Nei Saitou Neighbor Joining

Wed, 03 Apr 2019 00:00:00 +0000

1. Background
#

Before diving into code, the description of NJ algorithm can be found in

, where first column indicates parent node, and second column is its children node, the last column is the value of edge.