Sparklers at night

_Photo by Federico Beccari on Unsplash_

The Challenges of Complexity and Underutilization

Organizations typically have multiple different environments and frameworks to support their analytic work, with each tool providing specialized capabilities or serving different audiences. These usually include:

For example, in our most recent R Community Survey, we asked what tools and languages respondents used besides R. The results shown in Figure 1 illustrate the wide variety of tools that may be present in an organization.

Tools Chart
Figure 1: Respondents we surveyed use a wide variety of tools in addition to R.

These tools and frameworks provide flexibility and power but can also have two unpleasant, unintended consequences: productivity-undermining complexity for various stakeholders and underutilization of expensive analytic frameworks.

The stakeholders in the organization experience these consequences because:

Platforms diagram
Figure 2: Interoperability is a key strength of the R ecosystem.

Teams Need Interoperable Tools

Interoperable systems that give a data scientist direct access to different platforms from their native tools can help address these challenges. Everyone benefits from this approach because:

Encouraging Interoperability

Interoperability is a mindset more than technology. You can encourage interoperability throughout your data science team with four initiatives:

  1. Embrace open source software. One of the advantages of open source software is the wide community providing specialized packages to connect to data sources, modeling frameworks, and other resources. If you need to connect to something, there is an excellent chance someone in the community has already built a solution. For example, as shown in Figure 2, the R ecosystem already provides interoperability with many different environments.
  2. Make the data natively accessible. Good data science needs access to good up-to-date data. Direct access to data in the data scientist’s preferred tool, instead of requiring the data scientist to use specialized software, helps the data scientist be more productive and makes it easier to automate a data pipeline as part of a data product. Extensive resources exist to help, whether your data is in databases, Spark clusters, or elsewhere.
  3. Provide connections to other data science or ML tools. Every data scientist has a preferred language or tool, and every data science tool has its unique strengths. By providing easy connections to other tools, you expand the reach of your team and make it easier to collaborate and benefit from the work of others. For example, the reticulate package allows an R user to call Python in a variety of ways, and the Tensorflow package provides an interface to large-scale TensorFlow machine learning applications.
  4. Make your compute environments natively accessible. Most data scientists aren’t familiar with job management clusters such as Kubernetes and Slurm and often struggle to use them. By making these environments available directly from their native tools, your data scientists are far more likely to use them. For example, RStudio Server Pro allows a data scientist to run a script on a Kubernetes or Slurm cluster directly from within their familiar IDE.

Eric Nantz, a Research Scientist at Eli Lilly and Company, spoke at rstudio::conf 2020 about the importance of interoperability in R:

Learn more about Interoperability

In future posts, we will expand on this idea of Interoperability, with a particular focus on teams using R and Python, and how open source data science can complement BI tools.

If you’d like to learn more about Interoperability, we recommend these resources: