97
97
Sep 18, 2014
09/14
by
Anna Varizhuk; Dmitry Ischenko; Igor Smirnov; Olga Tatarinova; Vyacheslav Severov; Roman Novikov; Vladimir Tsvetkov; Vladimir Naumov; Dmitry Kaluzhny; Galina Pozmogova
texts
eye 97
favorite 0
comment 0
A growing body of data suggests that the secondary structures adopted by G-rich polynucleotides may be more diverse than previously thought and that the definition of G-quadruplex-forming sequences should be broadened. We studied solution structures of a series of naturally occurring and model single-stranded DNA fragments defying the G3+NL1G3+NL2G3+NL3G3+ formula, which is used in most of the current GQ-search algorithms. The results confirm the GQ-forming potential of such sequences and...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/01/23/001990
86
86
Sep 18, 2014
09/14
by
Paula Tataru; Jasmine A. Nirody; Yun S. Song
texts
eye 86
favorite 0
comment 0
Summary: We present a tool, diCal-IBD, for detecting identity-by-descent (IBD) tracts between pairs of genomic sequences. Our method builds on a recent demographic inference method based on the coalescent with recombination, and is able to incorporate demographic information as a prior. Simulation study shows that diCal-IBD has significantly higher recall and precision than that of existing SNP-based IBD detection methods, while retaining reasonable accuracy for IBD tracts as small as 0.1 cM....
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/09/03/005082
737
737
Sep 18, 2014
09/14
by
John G Cleary; Ross Braithwaite; Kurt Gaastra; Brian S Hilbush; Stuart Inglis; Sean A Irvine; Alan Jackson; Richard Littin; Sahar Nohzadeh-Malakshah; Minita Shah; Mehul Rathod; David Ware; Len Trigg; Francisco M De La Vega
texts
eye 737
favorite 0
comment 0
The analysis of whole-genome or exome sequencing data from trios and pedigrees has being successfully applied to the identification of disease-causing mutations. However, most methods used to identify and genotype genetic variants from next-generation sequencing data ignore the relationships between samples, resulting in significant Mendelian errors, false positives and negatives. Here we present a Bayesian network framework that jointly analyses data from all members of a pedigree...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/01/24/001958
133
133
Sep 18, 2014
09/14
by
Mario Fruzangohar; Esmaeil Ebrahimie; David L Adelson
texts
eye 133
favorite 0
comment 0
Gene Ontology (GO) classification of statistically significant over/under expressed genes is a commonly used to interpret transcriptomics data in functional genomic analysis. In this approach, all significant genes contribute equally to the final GO classification regardless of their actual expression levels. However, the original level of gene expression can significantly affect protein production and consequently GO term enrichment, and genes with low expression levels can participate in the...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/05/14/004911
116
116
Sep 18, 2014
09/14
by
Shoshana Marcus; Hayan Lee; Michael Schatz
texts
eye 116
favorite 0
comment 0
Motivation: With the rise of improved sequencing technologies, genomics is expanding from a single reference per species paradigm into a more comprehensive pan-genome approach with multiple individuals represented and analyzed together. One of the most sophisticated data structures for representing an entire population of genomes is a compressed de Bruijn graph. The graph structure can robustly represent simple SNPs to complex structural variations far beyond what can be done from linear...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/04/06/003954
137
137
Sep 18, 2014
09/14
by
Todd J. Treangen; Brian D. Ondov; Sergey Koren; Adam M. Phillippy
texts
eye 137
favorite 0
comment 0
Though many microbial species or clades now have hundreds of sequenced genomes, existing whole-genome alignment methods do not efficiently handle comparisons on this scale. Here we present the Harvest suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Combined they provide interactive core-genome alignments, variant calls,...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/07/22/007351
155
155
Sep 18, 2014
09/14
by
Matti Pirinen; Tuuli Lappalainen; Noah A Zaitlen; GTEx Consortium; Emmanouil T Dermitzakis; Peter Donnelly; Mark I McCarthy; Manuel A Rivas
texts
eye 155
favorite 0
comment 0
Motivation: RNA sequencing enables allele specific expression (ASE) studies that complement standard genotype expression studies for common variants and, importantly, also allow measuring the regulatory impact of rare variants. The Genotype-Tissue Expression project (GTEx) is collecting RNA-seq data on multiple tissues of a same set of individuals and novel methods are required for the analysis of these data. Results: We present a statistical method to compare different patterns of ASE across...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/07/17/007211
257
257
Sep 18, 2014
09/14
by
David Lovell; Vera Pawlowsky-Glahn; Juan José Egozcue; Samuel Marguerat; Jürg Bähler
texts
eye 257
favorite 0
comment 0
In the life sciences, many measurement methods yield only the relative abundances of different components in a sample. With such relative\---|or compositional\---|data, differential expression needs careful interpretation, and correlation\---|a statistical workhorse for analyzing pairwise relationships\---|is an inappropriate measure of association. Using yeast gene expression data we show how correlation can be misleading and present proportionality as a valid alternative for relative data. We...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/08/25/008417
157
157
Sep 18, 2014
09/14
by
Adrin Jalali; Nico Pfeifer
texts
eye 157
favorite 0
comment 0
Motivation: Molecular measurements from cancer patients such as gene expression and DNA methylation are usually very noisy. Furthermore, cancer types can be very heterogeneous. Therefore, one of the main assumptions for machine learning, that the underlying unknown distribution is the same for all samples, might not be completely fullfilled. We introduce a method, that can estimate this bias on a per-feature level and incorporate calculated feature confidences into a weighted combination of...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/09/15/008185
228
228
Sep 18, 2014
09/14
by
Daniele Ramazzotti; Giulio Caravagna; Loes Olde Loohuis; Alex Graudenzi; Ilya Korsunsky; Giancarlo Mauri; Marco Antoniotti; Bud Mishra
texts
eye 228
favorite 0
comment 0
A tumor is thought to result from successive accumulation of genetic alterations–each resulting population manifesting itself with a novel 'cancer phenotype.' In each such population, clones of higher fitness, contributing to the selection of the cancer phenotype, enjoy a Darwinian selective advantage, thus driving inexorably the tumor progression to metastasis: from abnormal growth, oncogenesis, primary tumors, to metastasis. Evading glutamine deregulation, anoxia/hypoxia, senescence,...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/08/19/008110
135
135
Sep 18, 2014
09/14
by
Sune Pletscher-Frankild; Albert Pallejà; Kalliopi Tsafou; Janos X Binder; Lars Juhl Jensen
texts
eye 135
favorite 0
comment 0
Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease–gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/08/25/008425
77
77
Sep 18, 2014
09/14
by
Bud (Bhubaneswar) Mishra
texts
eye 77
favorite 0
comment 0
Understanding complex mammalian biology depends crucially on our ability to define a precise map of all the transcripts encoded in a genome, and to measure their relative abundances. A promising assay depends on RNASeq approaches, which builds on next generation sequencing pipelines capable of interrogating cDNAs extracted from a cell. The underlying pipeline starts with base-calling, collect the sequence reads and interpret the raw-read in terms of transcripts that are grouped with respect to...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2013/11/15/000489
80
80
Sep 18, 2014
09/14
by
Hong-Dong Li; Qing-Song Xu; Yi-Zeng Liang
texts
eye 80
favorite 0
comment 0
Identifying a small subset of discriminate genes is important for predicting clinical outcomes and facilitating disease diagnosis. Based on the model population analysis framework, we present a method, called PHADIA, which is able to output a phase diagram displaying the predictive ability of each variable, which provides an intuitive way for selecting informative variables. Using two publicly available microarray datasets, its demonstrated that our method can selects a few informative genes...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/02/05/002360
157
157
Sep 18, 2014
09/14
by
Nuno A Fonseca; John A Marioni; Alvis Brazma
texts
eye 157
favorite 0
comment 0
Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/07/15/005207
110
110
Sep 18, 2014
09/14
by
Carl Kingsford; Rob Patro
texts
eye 110
favorite 0
comment 0
Storing, transmitting, and archiving the amount of data produced by next generation sequencing is becoming a significant computational burden. For example, large-scale RNA-seq meta-analyses may now routinely process tens of terabytes of sequence. We present here an approach to biological sequence compression that reduces the difficulty associated with managing the data produced by large-scale transcriptome sequencing. Our approach offers a new direction by sitting between pure reference-based...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/06/24/006551
366
366
Sep 18, 2014
09/14
by
Giuseppe Narzisi; Jason A O'Rawe; Ivan Iossifov; Han Fang; Yoon-ha Lee; Zihua Wang; Yiyang Wu; Gholson J Lyon; Michael Wigler; Michael C Schatz
texts
eye 366
favorite 0
comment 0
We present a new open-source algorithm, Scalpel, for sensitive and specific discovery of INDELs in exome-capture data. By combining the power of mapping and assembly, Scalpel carefully searches the de Bruijn graph for sequence paths that span each exon. A detailed repeat analysis coupled with a self-tuning k-mer strategy allows Scalpel to outperform other state-of-the-art approaches for INDEL discovery. We extensively compared Scalpel with a battery of >10000 simulated and >1000...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/06/18/001370
88
88
Sep 18, 2014
09/14
by
R Daniel Kortschak; David L Adelson
texts
eye 88
favorite 0
comment 0
bíogo is a framework designed to ease development and maintenance of computationally intensive bioinformatics applications. The library is written in the Go programming language, a garbage-collected, strictly typed compiled language with built in support for concurrent processing, and performance comparable to C and Java. It provides a variety of data types and utility functions to facilitate manipulation and analysis of large scale genomic and other biological data. bíogo uses a concise and...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/05/12/005033
133
133
Sep 18, 2014
09/14
by
Qiyun Zhu; Michael Kosoy; Katharina Dittmar
texts
eye 133
favorite 0
comment 0
A new computational method of rapid, exhaustive and genome-wide detection of HGT was developed, featuring the systematic analysis of BLAST hit distribution patterns in the context of a priori defined hierarchical evolutionary categories. Genes that fall beyond a series of statistically determined thresholds are identified as not adhering to the typical vertical his-tory of the organisms in question, but instead having a putative horizontal origin. Tests on simulated genomic data suggest that...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/04/02/003731
178
178
Sep 18, 2014
09/14
by
Jelena Aleksic; Sarah H Carl; Michaela Frye
texts
eye 178
favorite 0
comment 0
Background: Next generation sequencing (NGS) is a widely used technology in both basic research and clinical settings and it will continue to have a major impact on biomedical sciences. However, the use of incorrect normalization methods can lead to systematic biases and spurious results, making the selection of an appropriate normalization strategy a crucial and often overlooked part of NGS analysis. Results: We present a basic introduction to the currently available normalization methods for...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/06/19/006403
73
73
Sep 18, 2014
09/14
by
Andrea Gobbi; Giuseppe Jurman
texts
eye 73
favorite 0
comment 0
Gene coexpression networks inferred by correlation from high throughput profiling such as microarray data represent a simple but effective technique for discovering and interpreting linear gene relationships. In the last years several approach have been proposed to tackle the problem of deciding when the resulting correlation values are statistically significant. This is mostly crucial when the number of samples is small, yielding a non negligible chance that even high correlation values are...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2013/12/03/001065
95
95
Sep 18, 2014
09/14
by
Sarah A Gagliano; Michael R Barnes; Michael E Weale; Jo Knight
texts
eye 95
favorite 0
comment 0
The increasing quantity and quality of functional genomic information motivate the assessment and integration of these data with association data, including data originating from genome-wide association studies (GWAS). We used previously described GWAS signals (hits) to train a regularized logistic model in order to predict SNP causality on the basis of a large multivariate functional dataset. We show how this model can be used to derive Bayes factors for integrating functional and association...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2013/12/04/000984
84
84
Sep 18, 2014
09/14
by
Yuande Tan
texts
eye 84
favorite 0
comment 0
Next generation sequencing is being increasingly used for transcriptome-wide analysis of differential gene expression. The primary goal in profiling expression is to identify genes or RNA isoforms differentially expressed between specific conditions. Yet, the next generation sequence-based count data are essentially different from the microarray data that are continuous type, therefore, the statistical methods developed well over the last decades cannot be applicable. For this reason, a variety...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/01/26/002097
101
101
Sep 18, 2014
09/14
by
Enrique Carrillo de Santa Pau; Juliane Perner; David Juan; Simone Marsili; David Ochoa; Ho-Ryun Chung; Daniel Rico; Martin Vingron; Alfonso Valencia
texts
eye 101
favorite 0
comment 0
We have analyzed publicly available epigenomic data of mouse embryonic stem cells (ESCs) combining diverse next-generation sequencing (NGS) studies (139 experiments from 30 datasets with a total of 77 epigenomic features) into a homogeneous dataset comprising various cytosine modifications (5mC, 5hmC and 5fC), histone marks and Chromatin related Proteins (CrPs). We applied a set of newly developed statistical analysis methods with the goal of understanding the associations between chromatin...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/09/06/008821
105
105
Sep 18, 2014
09/14
by
Pratha Sah; Lisa O. Singh; Aaron Clauset; Shweta Bansal
texts
eye 105
favorite 0
comment 0
A modular pattern, also called community structure, is ubiquitous in biological networks. There has been an increased interest in unraveling the community structure of biological systems as it may provide important insights into a system's functional components and the impact of local structures on dynamics at a global scale. Choosing an appropriate community detection algorithm to identify the community structure in an empirical network can be difficult, however, as the many algorithms...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/06/02/001545
72
72
Sep 18, 2014
09/14
by
Paolo Ferragina; Bud (Bhubaneswar) Mishra
texts
eye 72
favorite 0
comment 0
This paper reports an initial design of new data-structures that generalizes the idea of pattern- matching in stringology, from its traditional usage in an (unstructured) set of strings to the arena of a well-structured family of strings. In particular, the object of interest is a family of strings composed of blocks/classes of highly similar stringlets, and thus mimic a population of genomes made by concatenating haplotype-blocks, further constrained by haplotype-phasing. Such a family of...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/01/02/001669
82
82
Sep 18, 2014
09/14
by
Anatoly Yambartsev; Michael Perlin; Yevgeniy Kovchegov; Natalia Shulzhenko; Karina Mine; Andrey Morgun
texts
eye 82
favorite 0
comment 0
Gene regulatory networks are commonly used for modeling biological processes and revealing underlying molecular mechanisms. The reconstruction of gene regulatory networks from observational data is a challenging task, especially, considering the large number of involved players (e.g. genes) and much fewer biological replicates available for analysis. Herein, we proposed a new statistical method of estimating the number of erroneous edges that strongly enhances the commonly used inference...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2013/11/15/000497
127
127
Sep 18, 2014
09/14
by
Herbert M Sauro; Totte T Karlsson; Maciej Swat; Michal Galdzicki; Andy Somogyi
texts
eye 127
favorite 0
comment 0
We describe libRoadRunner, a cross-platform, open-source, high performance C++ library for running and analyzing SBML-compliant models. libRoadRunner was created primarily to achieve high performance, ease of use, portability and an extensible architecture. libRoadRunner includes a comprehensive API, Plugin support, Python scripting and additional functionality such as stoichiometric and metabolic control analysis. To maximize collaboration, we made libRoadRunner open source and released it...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2013/12/12/001230
82
82
Sep 18, 2014
09/14
by
Matthew D MacManes
texts
eye 82
favorite 0
comment 0
The widespread and rapid adoption of high-throughput sequencing technologies has changed the face of modern studies of evolutionary genetics. Indeed, newer sequencing technologies, like Illumina sequencing, have afforded researchers the opportunity to gain a deep understanding of genome level processes that underlie evolutionary change. In particular, researchers interested in functional biology and adaptation have used these technologies to sequence mRNA transcriptomes of specific tissues,...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/01/14/000422-0
74
74
Sep 18, 2014
09/14
by
Ludwig Krippahl; Fábio Madeira
texts
eye 74
favorite 0
comment 0
Background: Constraint programming (CP) is usually seen as a rigid approach, focusing on crisp, precise, distinctions between what is allowed as a solution and what is not. At first sight, this makes it seem inadequate for bioinformatics applications that rely mostly on statistical parameters and optimisation. The prediction of protein interactions, or protein docking, is one such application. And this apparent problem with CP is particularly evident when constraints are provided by noisy data,...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/02/03/002329
112
112
Sep 18, 2014
09/14
by
Konrad Ulrich Förstner; Jörg Vogel; Cynthia Mira Sharma
texts
eye 112
favorite 0
comment 0
Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. In order to draw biological conclusions based on RNA-Seq data, several steps some of which are computationally intensive, have to betaken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/05/19/003723
771
771
Sep 18, 2014
09/14
by
Sebastian Gil Anthony Konietzny; Phillip Byron Pope; Aaron Weimann; Alice Carolyn McHardy
texts
eye 771
favorite 0
comment 0
Background: Efficient industrial processes for converting plant lignocellulosic materials into biofuels are a key challenge in global efforts to use alternative energy sources to fossil fuels. Novel cellulolytic enzymes have been discovered from microbial genomes and metagenomes of microbial communities. However, the identification of relevant genes without known homologs, and elucidation of the lignocellulolytic pathways and protein complexes for different microorganisms remain a challenge....
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/05/21/005355
333
333
Sep 18, 2014
09/14
by
Hayan Lee; James Gurtowski; Shinjae Yoo; Shoshana Marcus; W. Richard McCombie; Michael Schatz
texts
eye 333
favorite 0
comment 0
Third generation single molecule sequencing technology is poised to revolutionize genomics by enabling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this,...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/06/18/006395
65
65
Sep 18, 2014
09/14
by
Ashish Bhan; Animesh Ray
texts
eye 65
favorite 0
comment 0
Can one hear the sound of a growing network? We address the problem of recognizing the topology of evolving biological or social networks. Starting from percolation theory, we analytically prove a linear inverse relationship between two simple graph parametersthe logarithm of the average cluster size and logarithm of the ratio of the edges of the graph to the theoretically maximum number of edges for that graphthat holds for all growing power law graphs. The result establishes a novel property...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/04/09/004028
165
165
Sep 18, 2014
09/14
by
Bo Li; Nathanael Fillmore; Yongsheng Bai; Mike Collins; James A Thomson; Ron Stewart; Colin Dewey
texts
eye 165
favorite 0
comment 0
RNA-Seq assembly facilitates the study of transcriptomes for species without sequenced genomes, but it is challenging to select the most accurate assembly in this context. To address this challenge, we developed a model-based score, RSEM-EVAL, for evaluating assemblies when the ground truth is unknown. Our experiments show that RSEM-EVAL correctly reflects assembly accuracy, as measured by REF-EVAL, a refined set of ground-truth-based scores that we also developed. With the guidance of...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/06/13/006338
102
102
Sep 18, 2014
09/14
by
Andrew S. Warren; Cristina Aurrecoechea; Brian Brunk; Prerak Desai; Scott Emrich; Gloria I. Giraldo-Calderón; Omar Harb; Deborah Hix; Daniel Lawson; Dustin Machi; Chunhong Mao; Michael McClelland; Eric Nordberg; Maulik Shukla; Leslie B. Vosshall; Alice R. Wattam; Rebecca Will; Hyun Seung Yoo; Bruno Sobral
texts
eye 102
favorite 1
comment 0
Motivation: RNA-Seq is a method for profiling transcription using high-throughput sequencing and is an important component of many research projects that wish to study transcript isoforms, condition specific expression, and transcriptional structure. The methods, tools, and technologies employed to perform RNA-Seq analysis continue to change, creating a bioinformatics challenge for researchers who wish to exploit these data. Resources that bring together genomic data, analysis tools,...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/08/14/007963.1
128
128
Sep 1, 2011
09/11
by
Claverie, Jean-Michel; Notredame, Cedric
texts
eye 128
favorite 3
comment 0
Includes index
Topic: Bioinformatics
147
147
Sep 18, 2014
09/14
by
Michael Kuhn; Andreas Beyer
texts
eye 147
favorite 0
comment 0
Gene expression programs have been found to be highly conserved between closely related species, especially when comparing the same tissue types between species. Such analysis is, however, much more challenging over larger evolutionary distances when complementary tissues cannot readily be defined. Here, we present the first cross-species mapping of tissue-specific and developmental gene expression patterns across a wide range of animals, including many non-model species. Importantly, our...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/09/02/007252
106
106
Sep 18, 2014
09/14
by
Yongsheng Li; Yunpeng Zhang; Shengli Li; Jianping Lu; Juan Chen; Zheng Zhao; Jing Bai; Juan Xu; Xia Li
texts
eye 106
favorite 0
comment 0
The development of human breast cancer is driven by changes in the genetic and epigenetic landscape of the cell. Despite growing appreciation of the importance of epigenetics in breast cancers, our knowledge of epigenetic alterations of non-coding RNAs (ncRNAs) in breast cancers remains limited. Here, we explored the epigenetic patterns of ncRNAs in breast cancers via a sequencing-based comparative methylome analysis, mainly focusing on two most popular ncRNA biotypes, long non-coding RNAs...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/01/28/002204
144
144
Sep 18, 2014
09/14
by
Guoli Sun; Alexander Krasnitz
texts
eye 144
favorite 0
comment 0
Background One of the most common goals of hierarchical clustering is finding those branches of a tree that form quantifiably distinct data subtypes. Achieving this goal in a statistically meaningful way requires (a) a measure of distinctness of a branch and (b) a test to determine the significance of the observed measure, applicable to all branches and across multiple scales of dissimilarity. Results We formulate a method termed Tree Branches Evaluated Statistically for Tightness (TBEST) for...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/06/05/002188
92
92
Sep 18, 2014
09/14
by
Yasin Şenbabaoğlu; George Michailidis; Jun Z Li
texts
eye 92
favorite 0
comment 0
Consensus clustering (CC) is an unsupervised class discovery method widely used to study sample heterogeneity in high-dimensional datasets. It calculates "consensus rate" between any two samples as how frequently they are grouped together in repeated clustering runs under a certain degree of random perturbation. The pairwise consensus rates form a between-sample similarity matrix, which has been used (1) as a visual proof that clusters exist, (2) for comparing stability among...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/03/11/002642
103
103
Sep 18, 2014
09/14
by
Narjes S. Movahedi; Zeinab Taghavi; Mallory Embree; Harish Nagarajan; Karsten Zengler; Hamidreza Chitsaz
texts
eye 103
favorite 0
comment 0
As the vast majority of all microbes are unculturable, single-cell sequencing has become a significant method to gain insight into microbial physiology. Single-cell sequencing methods, currently powered by multiple displacement genome amplification (MDA), have passed important milestones such as finishing and closing the genome of a prokaryote. However, the quality and reliability of genome assemblies from single cells are still unsatisfactory due to uneven coverage depth and the absence of...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/02/24/002972
94
94
Sep 18, 2014
09/14
by
Sikander Hayat; Chris Sander; Arne Elofsson; Debora S. Marks
texts
eye 94
favorite 0
comment 0
Transmembrane β-barrels are known to play major roles in substrate transport and protein biogenesis in gram-negative bacteria, chloroplasts and mitochondria. However, the exact number of transmembrane β-barrel families is unknown and experimental structure determination is challenging. In theory, if one knows the number of strands in the β-barrel, then the 3D structure of the barrel could be trivial, but current topology predictions do not predict accurate structures and are unable to give...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/06/25/006577
84
84
Sep 18, 2014
09/14
by
Vladimir Skornyakov; Maria Skornyakova; Antonina Shurygina; Pavel Skornyakov
texts
eye 84
favorite 0
comment 0
In this study Markov chain models of gene regulatory networks (GRN) are developed. These models gives the ability to apply the well known theory and tools of Markov chains to GRN analysis. We introduce a new kind of the finite graph of the interactions called the combinatorial net that formally represent a GRN and the transition graphs constructed from interaction graphs. System dynamics are defined as a random walk on the transition graph that is some Markovian chain. A novel concurrent...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/06/23/006361
121
121
Sep 18, 2014
09/14
by
Jeffrey Leek
texts
eye 121
favorite 0
comment 0
It is now well known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. We introduced surrogate variable analysis for estimating these artifacts by (1) identifying the part of the genomic data only affected by artifacts and (2) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/06/26/006585
167
167
Sep 18, 2014
09/14
by
Feng Zeng; Rui Jiang; Guoli Ji; Ting Chen
texts
eye 167
favorite 0
comment 0
The incorrect alignments are a severe problem in variant calling, and remain as a challenge computational issue in Bioinformatics field. Although there have been some methods utilizing the re-alignment approach to tackle the misalignments, a standalone re-alignment tool for long sequencing reads is lacking. Hence, we present a standalone tool to correct the misalignments, called ProbAlign. It can be integrated into the pipelines of not only variant calling but also other genomic applications....
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/09/02/008698
161
161
Sep 18, 2014
09/14
by
Chunlei Wu; Adam Mark; Andrew I. Su
texts
eye 161
favorite 0
comment 0
Biomedical knowledge is often represented as annotations of biological entities such as genes, genetic variants, diseases, and drugs. For gene annotations, they are fragmented across data repositories like NCBI Entrez, Ensembl, UniProt, and hundreds (or more) of other specialized databases. While the volume and breadth of annotations is valuable, their fragmentation across many data silos is often frustrating and inefficient. Bioinformaticians everywhere must continuously and repetitively...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/09/17/009332
66
66
Sep 18, 2014
09/14
by
Sterling Sawaya; James Boocock; Mik Black; Neil Gemmell
texts
eye 66
favorite 0
comment 0
Pausing of DNA polymerase can indicate the presence of a DNA structure that differs from the canonical double-helix. Here we detail a method to investigate how polymerase pausing in the Pacific Biosciences sequencer reads can be related to DNA structure. The Pacific Biosciences sequencer uses optics to view a polymerase and its interaction with a single DNA molecule in real-time, offering a unique way to detect potential alternative DNA structures. We have developed a new way to examine...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2013/12/02/001024
175
175
Sep 18, 2014
09/14
by
Samuel Minot; Stephen D Turner; Krista L Ternus; Dana R Kadavy
texts
eye 175
favorite 0
comment 0
Next-generation sequencing is increasingly being used to study samples composed of mixtures of organisms, such as in clinical applications where the presence of a pathogen at very low abundance may be highly important. We present an analytical method (SIANN: Strain Identification by Alignment to Near Neighbors) specifically designed to rapidly detect a set of target organisms in mixed samples that achieves a high degree of species- and strain-specificity by aligning short sequence reads to the...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/01/10/001727
382
382
Sep 18, 2014
09/14
by
Anna-Sophie Fiston-Lavier; Maite G. Barrón; Dmitri A. Petrov; Josefa González
texts
eye 382
favorite 0
comment 0
Transposable elements (TEs) constitute the most active, diverse and ancient component in a broad range of genomes. Complete understanding of genome function and evolution cannot be achieved without a thorough understanding of TE impact and biology. However, in-depth analysis of TEs still represents a challenge due to the repetitive nature of these genomic entities. In this work, we present a broadly applicable and flexible tool: T-lex2. T-lex2 is the only available software that allows routine,...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/09/16/002964
80
80
Sep 18, 2014
09/14
by
Sergey Koren; Todd J Treangen; Christopher M Hill; Mihai Pop; Adam M Phillippy
texts
eye 80
favorite 0
comment 0
Background: The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate...
Topic: Bioinformatics
Source: http://biorxiv.org/content/early/2014/02/07/002469