R or python for bioinformatics

“Which programming language is better though”? This is the first question that plagues anyone who ‘wants to learn bioinformatics’, and it’s a fair question to think about when you’re new to the field!

Bioinformaticians will use multiple languages and pipelines to achieve similar objectives with their data, and so wanting to know what’s best is natural. In the genomics space, the 2 main languages that have traditionally dominated processed data manipulation are R and Python, which are the languages we’ll be focusing on below.

Comparing R vs Python

To cut it simply, neither language is totally dominant over the other and each can have different purposes for bioinformatics. You'll need to weigh the strengths and weaknesses of each language to figure out which one will help you with your task. So let’s compare them head-to-head, summarized in this chart!

R or python for bioinformatics

R vs Python: Category Breakdown

Plotting

Plotting, in my opinion, is the foundation of communicating complex information to your audience. As I was told during my graduate school training,

A good figure is one that doesn’t need a legend to explain the data being presented; a statement that couldn’t be more accurate as you look through any great publication or presentation that’s given.

As such, being able to plot your data in multiple, easy to interpret, ways is an absolute requirement of the language you want to use. Thankfully both R and Python offer tons of plotting functionality for publication ready figures. However, in my experience R has offered far more plotting packages that are built specifically for genomics data, the most widely used, being ggplot2. R also offers the ability to plot in a lattice style as opposed to the ggplot2 style, offering multiple ways to generate plots and layer information. While python does have packages to plot the rudimentary plots we see, it hasn’t been my experience to view the more advanced and complex plots through the use of Python.

Advantage: R

Large data manipulation

Genomic data has always been comically large. From single BAM files reaching and exceeding 100GBs per file, to processed data reaching so many dimensions that Excel is unable to ingest the data. The ability to consume and process large datasets has therefore become a requirement for programming languages that process genomic data.

While both R and Python are able to consume a lot of data and process it, the advantage has to be given to Python. R is able to consume large amount of information, but with the advent of Single Cell processing, R packages have fallen short to their Python counterparts in keeping RAM consumption low. While a portion of the responsibility is on the code within the packages not being optimized for memory consumption, Python has still been leading the charge with large data manipulation in other fields, and has a clear and distinct advantage in this regard. As our need to process larger data continues, Python’s strength in this department will continue to shine.

Advantage: Python

Pre-existing packages

Continuing off of what was mentioned in the last section, genomics data processing involves the installation and use of tools and packages written by other experts. Historically, these packages are largely written and used within R as these lab’s expertise lies in that language. Highly used and well known packages like DESeq2, Seurat, and ggPlot2 are ones that are used in numerous high impact publications.

While it is possible to use these packages from R in Python (through rpy2), these solutions can often lead to problems for beginners. Newer packages are being written for both R and Python and so Python’s ability to process different datasets is increasing. However, at this point in time, R simply has more support.

Advantage: R

Interactive scripts

Oftentimes you’ll want to share your code in a clean way that other people can use and follow along, by reading through your comments above/below the blocks of code and also by viewing the generated plots. Thankfully both languages have developed great solutions to this problem.

Python offers the ability to generate and share Jupyter notebooks that can be played one line at a time while R allows users to build and distribute Markdown files that can be run in a similar manner. Both have provided excellent frameworks to distribute code in presentable ways.

Advantage: R and Python

Converting to API Endpoints

While this point may be more on the advanced side, it is important to know as you get more advanced in your coding abilities. This point covers the ability of wrapping your functions and code as an API endpoint, which simply means the ability for users to send in the variables your function/block of code expects and get the result back. The advantage of having something like this, is that multiple users will be able to check your code and functions without having to delve deeply into the code. It also allows for the API’s to be accessed publicly.

Python offers a variety of ways to bring up services to create these API endpoints, one of which being a FLASK service. Unfortunately, for all the strengths that R has, the only way I have seen to implement an API schema in R, is using the Plumber API. Alternatively, the R scripts can be made into an RShiny app. However, I personally have not seen R API’s being used readily, and so the advantage, from my perspective, goes to Python.

Advantage: Python

So…which programming language should I use?

Looking over the various points we’ve discussed, the main thing to keep in mind is that no one language is better than the other. Both programming languages have their strengths and weaknesses, and the key is to research what specific goal you want to accomplish, look at the features and packages offered by both languages, and then chose the one you want to implement!

Oftentimes you’ll notice that you could have used either language and had to overcome different problems along the way, with the key takeaway being the new skills you gained. Hopefully this list has helped you focus in on how to start tackling and achieving your first goal!

Are you just getting started with Bioinformatics? Subscribe to our blog at https://blog.biobox.io/blog — every blog post is specially curated by each member of our team to support you on your journey to learn Bioinformatics!

Should I learn R or Python for bioinformatics?

In the context of biomedical data science, learn Python first, then learn enough R to be able to get your analysis done, unless the lab that you're in is R-dependent, in which case learn R and fill in the gaps with enough Python for easier scripting purposes. If you learn both, you can R code into Python using rpy.

Is R good for bioinformatics?

R is one of the most widely-used and powerful programming languages in bioinformatics. R especially shines where a variety of statistical tools are required (e.g. RNA-Seq, population genomics, etc.) and in the generation of publication-quality graphs and figures.

Is Python enough for bioinformatics?

In Bioinformatics, python is extensively used for data analysis and development of tools. Python is also a general-purpose, object-oriented programming language. It can be used as the primary language for the implementation of complete packages and applications that have the great advantage of platform independence.

Which language is best for bioinformatics?

Perl has been really the go-to language for computer programming in bioinformatics. Though obsolete in several other languages, it is still widely used in bioinformatics, and it's certainly one of the go-to languages even today for bioinformatics/computational biology.