PROGRAMMING
An example for data scientists who believe they don’t need to know C++
Published in · 4 min read · Dec 16, 2020
--
There are millions of reasons to love Python (especially for data scientists). But how is Python different from more professional low-level programming languages like C or C++? I guess this is a question that many data scientists or Python users asked or will ask themselves one day. There are many differences between Python and languages like C++. For this article, I am going to show you how fast C++ is compared to Python with a super simple example.
To show the difference, I decided to go with a simple and practical task instead of an imaginary task. The task that I am going to accomplish is to generate all possible DNA k-mers for a fixed value of “k”. If you don’t know about DNA k-mers, I explain it in plain language in the next section. I chose this example because many genomic-related data processing and analysis tasks (e.g. k-mers generation) are considered computationally intensive. That’s a reason why many data scientists in the field of bioinformatics are interested in C++ (in addition to Python).
IMPORTANT NOTE: The goal of this article is not to compare C++ and Python in their most efficient way. Both codes could be written in a more efficient way and using much better approaches. The only goal of this article is to compare these two languages when they are going through exactly the same algorithm and instructions.
A DNA is a long chain of units called nucleotides. In DNA, there are 4 types of nucleotides shown with letters A, C, G, and T. Humans (or more precisely hom*o Sapiens) have 3 billion nucleotide pairs. For example, a small portion of human DNA could be something like: ACTAGGGATCATGAAGATAATGTTGGTGTTTGTATGGTTTTCAGACAATT