Top Languages and Packages for Machine Learning


Recent trends in popular demand for data science skills have been increasing, but many would-be data science candidates and companies looking to grow their data science departments are asking which languages are worth investing the time to learn. The number of machine learning libraries and packages varies across languages, and while having programing experience in at least one language is certainly useful, understanding the pros and cons of the other languages will help you decide which language to next pick up to make yourself a marketable data science candidate. In late 2016, Fossbytes did a study that analyzed the skills sections of job postings for “machine learning” or “data science” and plotted the tend over time:

It is surprising that the demand for Java programmers has surpassed R and also that Scala, a functional programing language, quite different than the rest has also grown heavily over a four year period. Below, I discuss in depth the machine learning capabilities of five of these languages.

Python

Python arguably has the most expansive libraries and toolkits for data science and machine learning. Numpy, SciPy, Pandas, MatPlotLib, and Seaborn remain some of the most popular and powerful libraries for scientific computing, data analysis, and data visualizations.

With over 16,000 commits to GitHub to date, NumPy (Numerical Python) is the most fundamental package around which the scientific computation stack is built. NumPy facilitates the use of matrix math by using arrays as the fundamental data structure. Where NumPy is lacking, Pandas makes up for by using a decorative wrapper around the fundamental array. This creation of a “dataframe” allows us to work with “labeled” and “relational” data. A dataframe is a fancy table with indexed rows for quick querying. Pandas is great for data wrangling and quick analysis.

SciPy contains modules for linear algebra, optimization, integration, and statistics. SciPy arrays make substantial use of NumPy. SciPy also provides the groundwork for SciKit-Learn, the machine learning libraries built in Python.

MatPlotLib and Seaborn are two heavily used toolkits for visualizations. MatPlotLib remains pretty low-level, requiring you to write more code to produce sophisticated charts and plots. Some simple visualizations that can be created on the fly are line plots, scatter plots, bar charts, histograms, pie charts, and contour plots. While MatPlotLib eases your plotting capabilities, Seaborn is mostly focused on the visualization of statistical models.

Java

There are many benefits to building apps in Java. Java is database/platform independent, allowing a multitude of options for data storage and ease in switching between platforms. Java is portable, allowing locally built applications to be pushed to the cloud for future use. Java apps are also meant to scale with company growth and and expanding user base. Most importantly, Java is a well-established language with a huge enterprise user-base. This allows both present day and future integration of Java built apps with different technology stacks.

The specific machine learning applications for Java libraries are quite expansive. The Waikato Environment for Knowledge Analysis (WEKA) is a popular machine learning platform with a Graphical User Interface (GUI). Kostanz Information Miner (KIME) and Rapid Miner are two other Java APIs with useful visualizations.

One place where Java unequivocally wins out over other languages is in the map reduce implementation of Big Data. Underneath the hood, Apache Mahout is written in Java. Mahout is the first machine learning technology to take advantage of Apache Hadoop’s power to solve complex problems by breaking them up into parallel tasks. However, with the recent advent of Apache Spark, parallelization can be performed in many languages. Thus the need to write many map-reduce jobs in Java has become obsolete.

R

R is a free, open source, statistical programing language with great community support. R is used to build models and has excellent visualization capabilities. Although originally developed for academic research, it is quickly being implemented among various fields in industry. Everywhere from pharmaceutical companies to financial companies, models and plots written in R can be found. The R community relies heavily on support from other users and is very active on Stack Overflow. CRAN, a large repository of curated R packages to which anyone can contribute, is also publicly available. Because R is slightly less popular for machine learning that Python, its developers are valued slightly more in the marketplace. In 2015, the average R programmer had an annual salary of $115k, whereas the average Python programmer made only $94k.

R is great for exploratory work and data analysis and has many available package. For data visualization, ggviz, lattice, and ggplot2 are available. CARET is most most important for machine learning engineers. Caret contains tools for data splitting, pre-processing, model tuning, and variable importance estimation. It is competitive with Python’s SciKit-Learn.

C++

For data scientists who want to use machine learning as a tool, C++ is not a good place to begin. Very few libraries exist in C++. However, programmers who want to implement machine learning algorithms from scratch have more control and flexibility by coding in a language like C or C++. Although C++ has a much steeper learning curve for beginners than Python, it is much easier to develop expertise once an understanding of the basic intricacies of the language is developed.

C++ is also much more efficient than other languages. Important libraries, such as Tensorflow and Torch are implemented in C++ under the hood. Although they are not up to the same standards as SciKit-Learn and CARET, C++ libraries such as Shark and mlpack do exist. Not surprisingly, few new data scientists gravitate toward C++. However, comfortable C++ programmers often begin their venture into machine learning by first building algorithms in their native language.

Scala

Scala is notorious for its ability to achieve large-scale data processing tasks. Because it is a functional programing language, it relies less on memory to execute programs. Moreover, 71% of programmers who do machine learning in Spark do so in Scala.

Spark is an important data science tool that is allowing machine learning engineers grow their models to scale. Over 50% of all new Scala programmers are picking up Scala because Spark is written in Scala. Spark is also encouraging traditional functional programmers to start building machine learning models.