Data Science Wars: Python vs. R

Python_RAs I frequently travel in data science circles, I’m hearing more and more about a new kind of tech war: Python vs. R. I’ve lived through many tech wars in the past, e.g. Windows vs. Linux, iPhone vs. Android, etc., but this tech war seems to have a different flavor to it. What feels different in this case is that the application area is the same, namely performing work in data science where the solution often depends on the use of libraries that implement various machine learning algorithms. This being the case, the question is what language should you adopt as a data scientist?

While R has traditionally been the programming language of choice for data scientists, some believe it is ceding ground to Python. Here is a short list of some the arguments I’ve heard of late, along with my personal assessment of each:

R is Too Complex

The most frequently stated argument I’ve heard is the view that Python is general purpose and comparatively easy to learn whereas R remains a somewhat complex programming environment to master. I think this view is misguided, since complexity is in the eye of the beholder (i.e. the programmer).  I will agree that R is certainly a very powerful data analysis and data modeling tool with specific emphasis for machine learning. Moreover, much of the power of R is in its ecosystem. There are more than 5,000 packages that extend the open source statistical environment to new heights.

When I first learned R, I did not find it particularly complex; it was a lot easier for me to learn R than C++ or Java with their mammoth frameworks. Besides, the application of machine learning is much more “complex” than any programming language used to develop a given algorithm. Using Python won’t circumvent that fact.

R Isn’t Really a Language

Another argument says that part of the reason people struggle to learn R is that it’s not really a language. As R expert John Cook points out, R “is an interactive environment for doing statistics,” not really a programming language. He also suggests, “I find it more helpful to think of R as having a programming language than being a programming language.” This view may be somewhat accurate, but if R doesn’t look like a traditional programming language, it doesn’t mean it isn’t one. It is simply well-suited for its intended problem domain, namely statistical analysis and machine learning. Once its nuances are mastered, R developers tend to swear by it and use it as a primary tool for data science projects. Plus, R tends to reduce complexity for data scientists because it incorporates vectorized operations that are important to the linear algebra principles inherent in many machine learning algorithms.

Python is More Approachable

Some feel that Python is more approachable. I’ve heard some say that since all sorts of developers are familiar with Python and use it for a wide array of applications, it is the more optimal choice for data science – unlike R, which is pretty much only used for data analysis. I feel this is a silly argument. Wouldn’t you want to use a tool that is specifically suited to a particular task, rather than one that didn’t include specific features for the intended problem domain? There’s nothing wrong with using a special purpose programming language to implement special purpose problems.

Remember, R is a very old statistical environment that has an incredible global following. The functions in the base stats package, as well as many packages found in CRAN, in many case are based on very old implementations (some in Fortran) of classic algorithms (e.g. the Random Forest algorithm in R is based on the original Fortran code by Leo Breiman and Adele Cutler). It is good to know that the modeling language I’m using has a long and trusted history.

People in the Organization Already Know Python

I’ve heard some people express the view that as businesses grapple to get more values out of data assets, they’re also struggling to find qualified data scientists. They say that more often than not such data scientists may already work internally and likely have some familiarity with Python. The feeling is that given the importance of asking the right questions of one’s data, training up homegrown talent on big data technologies is much more effective than training new-hire data scientists on the complexities of one’s business.

While it is a well-founded policy to hire within an organization to fill certain positions since the candidate may likely have valuable domain experience, I think it is a huge stretch of the imagination to think a talented data scientist is lurking somewhere in the organization just waiting to pick up the torch without losing stride. Huh? Is the data scientist just slumming for a while as a UI developer using Python? I don’t think this scenario is very likely. It is much more reasonable to see that data scientist as a new hire. The talent you’re seeking is based on computer science, machine learning, mathematical statistics and probability theory, hardly the skill set of a run-of-the-mill IT staffer.

A Single Language Across Applications

Yet another argument I’ve heard is that beyond tapping into a ready-made Python developer pool, one of the biggest benefits of doing data science in Python is added efficiency of using one programming language across different applications. I really don’t see this as a problem because in my world there are both “theorist” data scientists and “experimental” data scientists. R is ideal for the theorist who does exploratory data analysis, data munging, modeling and algorithm building. Python, on the other hand is a good tool for the experimentalist who implements algorithms for production use. I believe R and Python make excellent partners in building data science solutions. As a consultant I often work with internal developers, many times using Python, to implement my algorithms in R.

A Path for the Future

Python lacks much of R’s richness for data analysis, data modeling and machine learning, but it is making progress. At this point, data science is a very technical area and in my mind you can’t give up R’s depth in favor of Python’s approachability and general-purpose nature. As I mentioned above, the two languages can and do live together nicely. Data science will always be the realm of “scientists who deal with data,” and that will not change anytime soon considering the overly simple nature of the recent “machine learning as a service” product offerings. Practitioners still maintain a firm foundation in mathematics and statistics, which is beyond mortal business analysts and others.

So at least for now, I’d like to douse the flames of the Python vs. R tech war. I think there’s plenty of room for two good choices in the pursuit of robust data science.

 

 

Comments

  1. Nir Friedman says:

    I think your article is fairly biased, which is fine, but it’s better to be upfront with your biases. I think you somewhat misrepresent the arguments, so let me put in my $0.02:

    1) R taken as a programming language as opposed to a massive collection of libraries, is… well, it’s bad. OO is not a core part of the language, and this is a big issue. There are two different styles of doing classes depending which package you import. Dictionaries are not part of the language standard. Etc. Reality is that S (R’s dad) was designed by statisticians with zero knowledge of software engineering. Matlab was designed by engineers who sort of knew how to program. And Python was designed by a pro. And it shows in exactly how I would rank the core language from a programming perspective.

    2) Following up with above: Python is consistently considered one of the easiest languages to learn. R is considered pretty hard. Not much to argue about here.

    3) You really could not argue against yourself more effectively than by lauding, of all things, R’s vectorization. NumPy has the exact same types of vectorization operations as R. Matlab does as well, with easily the nicest syntax out of the three. Both these languages are usually faster than R. R’s slicing is completely broken compared to these languages: if v is a vector, than no values of i and j will return an empty list for v[i:j]. This is terrible behavior that breaks a case in many situations involving slicing.

    4) There always is some advantage in sticking to one language. When it comes to anything outside statistic, Python is far superior. Querying databases, processing text, making system calls, etc. There is a real mental cost in switching from one language to another, not to mention making them play nice with each other.

    The reality is that R has more packages for stats than Python, and Python wins on everything else. Of course, if you really care about doing stats, that may be reason enough to use R.

  2. Regarding Friedman’s comment about the difficulty of learning R, I’ve always found this criticism to be strange. It’s a bit inconsistent, but learning R was about as easy as learning SQL for me. I stayed away from general purpose programming in R, but the statistics aspect is very straightforward.

    I still like Python and think it may be the future. Still needs a lot of work so people aren’t constantly rewriting functions to find something as simple as a decile.

  3. This is with respect to learning curve of R. I come from a non-programming background but with some efforts, over a period of 3- months became good in R. Therefore, it is not as hard to learn as it seems. ( I had prior background in statistics)
    However, the problem that I am currently facing is the production level deployment of analytic solution at the client. I cannot accomplish it in R because that way I would expose code to the client and no novelty would be left for me.
    With this perspective I was told that Python is better. I don’t know because I have not worked in Python. Any help here? Do I have to learn Python in order to develop analytic products for the client.

    • @Anurag, thanks for your perspective on this debate. Successfully learning R is a lot about aptitude. You indicate you have a stats background so that may be why R was easier for you. You say you wish to protect your code for competitive reasons; that is sensible. This means you’ll need a compiled language so you can deploy without distributing the source code. With Python you can generate bytecode which can lend some protection, but that is not fail safe since there are reverse bytecode compilers that can yield source. R also has a just-in-time compiler. – Daniel

Resource Links: