As a data scientist, or as a leader of a data science team, you know the power and flexibility that open source data science delivers. However, if your team works within a typical enterprise, you compete for budget and executive mindshare with a wide variety of other analytic tools, including self-service BI and point-and-click data science tools. Navigating this landscape, and convincing others in your organization of the value of open source data science, can be difficult. In this blog post, we draw on our recent webinar on this topic to give you some talking points to use with your colleagues when tackling this challenge.
However, it is important to keep in mind that “code-first” does not mean “code only.” While code is often the right choice, most organizations need multiple tools, to ensure you have the right tool for the task at hand.
The Pitfalls of BI Tools and Codeless Data Science
There are multiple ways to approach any given analytic problem. At their core, various data science and BI tools share many aspects. They all provide a way of drawing on data from multiple data sources, and to explore, visualize and understand that data in open-ended ways. Many tools support some way of creating applications and dashboards that can be shared with others to improve their decision-making.
Since these very different approaches can end up delivering applications and dashboards that may (at first glance) appear very similar, the strengths and nuances of the different approaches can be obscured to decision makers, especially to executive budget holders—which leads to the potential competition between the groups.
However, when taking a codeless approach, it can be difficult to achieve some critical analytic best practices, and to answer some very common and important questions:
- Difficulty tracking changes and auditing work: When modifications and additions are obscured in a series of point-and-click steps, it can be very challenging to answer questions like:
- Why did we make this decision in our analysis?
- How long has this error gone unnoticed?
- Who made this change?
- No single source of truth: Without a centralized way of sharing and storing analyses and reports, different versions and spreadsheets can proliferate, leading to questions like:
- Is this the most recent [data, report, dashboard]?
- Is the file labeled
sales-data 2020-12 final FINAL Apr 21 NR (4).xlsxreally the most recent version of the analysis?
- Where do I find the [data, report, dashboard] I am looking for? And who do I have to email to get the right link?
- Difficult to extend and reproduce your work: When you are depending on a proprietary platform for your analysis, with the details hidden behind the point-and-click interface, you might face questions like:
- What did our model say 6 months ago?
- Can I apply this analysis to this new (slightly different) data/problem?
- Are we actually meeting the relevant regulatory requirements?
- Is our work truly portable? Will others be able to reproduce and confirm our results?
At best, wrestling with questions like these will distract an analytics team, burning precious time that could be spent on new, valuable analyses. At worst, stakeholders end up with inconsistent or even incorrect answers because the analysis is wrong, not the correct version, or not reproducible. This can fundamentally undermine the credibility of the analytics team. Either way, the potential impact of the team for supporting decision makers is greatly reduced.
The benefits of code-first data science
RStudio’s mission is to create free and open-source software for data science, because we fundamentally believe that this enhances the production and consumption of knowledge, and facilitates collaboration and reproducible research.
At the core of this mission is a focus on a code-first approach. Data scientists grapple every day with novel, complex, often vaguely-defined problems with potential value to their organization. Before the solution can be automated, someone needs to figure out how to solve it. These sorts of problems are most easily approached with code.
With Code, the answer is always yes!
- Flexible: With code, there are no black box constraints. You can access and combine all your data, and analyze and present it exactly as you need to.
- Iterative: With code, you can quickly make changes and updates in response to feedback, and then share those updates with your stakeholders.
- Reusable and extensible: With a code-first approach, you can tackle similar problems in the future by applying your existing code, and extend that to novel problems as circumstances change. This makes code a fundamental source of Intellectual Property in your organization.
- Inspectable: With code, coupled with version control systems like git, you can track what has changed, when, by whom, and why. This helps you discover when errors might have been introduced, and audit the analytic approach.
- Reproducible: When combined with environment and package management (such as the capabilities provided by RStudio Team, you can ensure that you will be able to rerun and verify your analyses. And since your data science is open source at its core, you can be confident that others will be able to rerun and reproduce your analysis, without being reliant on expensive proprietary tools.
|Codeless Problem||Code-First Solution|
Difficulty tracking changes and auditing work
Code, coupled with version control systems like git, to track what changed, when, by whom, and why.
Code can be logged when run for auditing and monitoring.
No single source of truth
Centralized tools to create a single source of truth for data, dashboards, and models.
Version control to track multiple versions of code separately without creating conflicts.
Difficult to extend and reproduce work
Code enables reproducibility by explicitly recording every step taken.
Open-source code can be deployed on many platforms, and is not dependent on proprietary tools.
Code can be copied, pasted, and modified to address novel problems as circumstances change.
Black box constraints on how you analyze your data and present your insights
Access and combine all your data, and analyze and present it exactly as you need to, in the form of tailored dashboards and reports.
Pull in new methods and build on other open source work without waiting for proprietary features to be added by vendors.
Objections to Code-First Data Science
When discussing the benefits of a code-first approach within your organization, you may hear some common objections:
- “Coding is too hard!”: In truth, it’s never been easier to learn data science with R. RStudio is dedicated to the proposition that code-first data science is uniquely powerful, and that everyone can learn to code. We support this through our education efforts, our Community site, and making R easier to learn and use through our open source projects such as the tidyverse.
- “Does code-first mean only code?”: Absolutely not. It’s about choosing the right tool for the job, which is why RStudio focuses on the idea of Interoperability with the other analytic frameworks in your organization, supporting Python alongside R, and working closely with BI tools to reach the widest possible range of users.
- “But R doesn’t provide the enterprise features and infrastructure we need!”: Not true. RStudio’s professional product suite, RStudio Team, provides security, scalability, package management and centralized administration of development and deployment environments, delivering the enterprise features many organizations require. Our hosted offerings, RStudio Cloud and Shinyapps.io, enable data scientists to develop and deploy data products on the cloud, without managing their own infrastructure.
To Learn More
If you’d like to learn more about the advantages of code-first data science, and see some real examples in action, watch the free, on-demand webinar Why Your Enterprise Needs Code-First Data Science. Or, you can set up a meeting directly with our Customer Success team, to get your questions answered and learn how RStudio can help you get the most out of your data science.