Reveal to Debut SAS to Python Translation Product at FedCASIC 2025

Reveal data scientists Cameron Milne, Ellie Mamantov, and John Lynagh will present a new SAS to Python product at FedCASIC on April 23, 2025. The product is supporting the U.S. Census Bureau’s migration of SAS programs to Python by performing automated translations quickly and inexpensively compared with other code translation products on the market. Read more about Reveal’s work migrating SAS Codebases to Python below.

Migrating SAS to Python: Reveal Accelerates Codebase Migration with a Homegrown Translation Product that Shrinks Translation Time and Costs

by Cameron Milne, Senior Data Scientist

Reveal Global Consulting developers designed a large language model (LLM) based product that scales automated translation of SAS to Python to support the U.S. Census Bureau’s modernization efforts while saving costs. Statistical Analysis System (SAS) is a programming language with powerful data processing and statistical analysis capabilities, powering the survey lifecycle at the Census Bureau for decades. SAS is a proprietary language that has become increasingly expensive compared with free, open-source alternatives like Python or R. As the Census Bureau migrates from on-premises servers to the cloud over the next few years, it seeks to refactor codebases that run its most critical surveys to 1) save costs and 2) consider the long-term effects on recruiting, as fewer universities are prioritizing SAS within their curriculums.

Code Translation Options

Historically, code translation has been a costly endeavor for organizations because it 1) takes developers’ time away from business-critical operations; 2) necessitates expertise in the source and target languages; and 3) often includes extended periods of time troubleshooting issues. For decades, translators relied on hand-crafted rewrite rules layered over Abstract Syntax Trees (ASTs), which were difficult to read and, in places, ineffective, forcing developers to spend much of their time performing manual corrections. The Commonwealth Bank of Australia, for example, relied on these strategies in 2012 for a project that ballooned into $750 million dollars over five years. More recently, Meta’s AI Research proved unsupervised learning could perform with more flexibility if trained on hundreds of thousands of examples of equivalent code across C++, Java, and Python. Reliance on massive parallel data for pre-training, however, isn’t viable for teams that cannot spare the months of dataset curation and allocation of talent for a strategy that might only yield small improvements.

LLMs offer a promising alternative that is faster, inexpensive, and can be supported by any developer with an understanding of the source and target languages. This article shows how Reveal designed a product that prioritizes speed and scalability, providing high quality “first-pass” drafts that reduce the “last-mile” effort involved in making translations work.

Designing a Pipeline

The translation pipeline was designed under a few assumptions:

  • Model interoperability: we can swap LLMs with minimal refactoring

  • Scalability: SAS programs of varying size and complexity are translated automatically

  • Feedback-agnostic: client data cannot be exposed to the model, ruling out agent-driven systems that use feedback loops to improve after a first-pass.

Moreover, using a LLM to generate words comes with a key limitation: maximum number of input tokens (i.e. how many words an LLM can take in at once). This forced us to build a product that can intuitively break down SAS programs of variable lengths into smaller chunks while still producing coherent translations.

Step 1: Parsing and Rule-Based Conversions

The first step is breaking a SAS program into a series of categories for later processing: single-line comments, multi-line comments, steps (i.e. data, proc), macros, and any additional loose code (formatting, options, etc.). The result can be stored in a JSON format for indexing. Once parsed, we perform several rule-based conversions where applicable. Typically, array instantiations are repetitive in SAS programs and, when given a choice, LLMs will default to pseudocode leaving developers with an incomplete output. Custom functions that can convert these arrays into Python arrays along with a flag for avoiding translation allows us to avoid model laziness. Similarly, dependencies (marked by “%let” typically) in SAS programs are marked as “ignore” and later translated separately.

SAS to Python Step 1: Processing and Rule-based Conversions

Step 2: Chunking

After parsing, “Steps” are identified and chunked separately. Data or procedure (“proc”) steps can range from a couple lines of code to thousands; breaking down the larger steps into digestible chunks for translation is crucial. The threshold for where chunking begins and ends depends on what the code is doing. For example, we might separate macros, if-then logic (especially when nested), loops, and more. We also experiment with token counting strategies (fixed, normalized) to balance context. The goal is to preserve code within a chunk that is likely associated with one another.

Reveal SAS to Python Step 2: Chunking

Step 3: Grouping

Grouping focuses on bundling chunks from the previous layer into larger chunks that increase context for the model while remaining comfortably below the maximum token window. The rationale is that more context leads to higher quality output. Considerations include containing groups within larger steps, and pairing comments with code that follows as the comment might explain a concept that is more helpful for language adaptation than the code itself.

Reveal SAS to Python Step 3: Grouping

Step 4: Translation-Documentation Loop

The most challenging part of the pipeline is managing information previously translated for future translations. With separate translations, the LLM will reproduce information from previous translations leading to repetitive code or worse, design code that doesn’t effectively build off what was previously instantiated. Another LLM tasked with reading translation outputs and producing documentation solves this problem. Documentation includes variables (filepaths, environment variables, datasets, etc.). and a description of how it is used. Relevant variables are passed to the prompt along with the next chunk to improve coherence across translations.

Reveal SAS to Python Step 4: Translation-Documentation Loop

Step 5: Post-Processing

The translated code is then passed through a post-processing layer that includes several checks. Import commands are relocated to the top of the script along with any documentation identified in the SAS program such as who wrote and edited the program. We also prune semantic and syntax errors before performing code alignment (pep8). The result is a concatenation of each of the individual translations for improved coherence.

Reveal SAS to Python Step 5: Post-Processing

Moving Forward

Reveal’s SAS to Python translation product has already supported major surveys in the U.S. Census Bureau. The National Sample Survey of Registered Nurses (NSSRN) in the Demography directorate was successfully migrated to Python with migrations actively in progress for a number of other surveys at the Bureau.

Interested in moving from proprietary code to an option that is newer, more flexible, and free? Contact the Reveal team to learn more about our SAS to Python translation services.

Previous
Previous

Reveal to Showcase Large Language Model Autocoder at FedCASIC 2025

Next
Next

Reveal Launches Innovation Lab