7 minute read

This article explores how the CDISC Operational Data Model (ODM) is underutilized, and could be the missing infrastructure layer in clinical data automation.

article image 5

Image created by the author

In clinical data workflows, the technical potential has always outpaced organizational readiness.

Clinical data standards like CDISC ODM give us everything we need to automate SDTM transformations, but the industry hasn’t taken full advantage. This post explores why, and how we can finally bridge that gap.

For over two decades, CDISC has provided us with deeply thoughtful metadata standards. The Operational Data Model (ODM) wasn’t just created to document what we did; it was designed to enable automation, consistency, and reuse. It gave us a structured, machine-readable way to describe the forms, fields, controlled terminology, and relationships that make up a clinical trial.

And yet, here we are: still copying SDTM specs into Excel, writing thousands of lines of SAS code by hand, and maintaining hardcoded mappings across studies that look nearly identical.

It’s not because we’re lazy. It’s because the people doing the work are under-resourced, overburdened, and deeply siloed, and the systems that support them were never designed for modern workflows.

But now, there is opportunity to change that.

NOTE: For definitions of any unfamiliar terms, see the Glossary of Acronyms at the end of this article.

The ODM: More Than Metadata — It’s Infrastructure

Let’s start with the basics.

The CDISC Operational Data Model (ODM-XML) is a standardized format for representing:

  • Study design.
  • Case report forms.
  • Item definitions (with datatypes, labels, units, etc.).
  • Codelists and value-level metadata.
  • Audit trails, versioning, and relationships.

It’s verbose, yes, but it’s also deeply rich, and if we treat it as the source of truth, it can do more than describe a study. It can become the foundation for:

  • Automated SDTM scaffolding.
  • Study-agnostic data transformations.
  • Real-time metadata traceability.
  • Spec-to-code pipelines with overrides.
  • AI-assisted mapping and validation.

ODM often goes underutilized because we’ve never been encouraged to treat it as a functional tool.

Example: What’s Actually in an ODM File?

Here’s a simplified snippet of what you might find in a real-world ODM file:

<ItemDef OID="IT.QS.QSSTRESC" Name="Severity" DataType="text">
  <Question>Severity of Pain</Question>
  <CodeListRef CodeListOID="CL.SEVERITY_SCALE"/>
</ItemDef>

<CodeList OID="CL.SEVERITY_SCALE" Name="Severity Scale">
  <CodeListItem CodedValue="MILD"><Decode><TranslatedText>Mild</TranslatedText></Decode></CodeListItem>
  <CodeListItem CodedValue="MOD"><Decode><TranslatedText>Moderate</TranslatedText></Decode></CodeListItem>
  <CodeListItem CodedValue="SEV"><Decode><TranslatedText>Severe</TranslatedText></Decode></CodeListItem>
</CodeList>

From just this, a system could:

  • Infer the appropriate STRESC and STRESN logic
  • Scaffold the QS domain’s QSSTRESC variable with appropriate mapping
  • Validate data values against the codelist

Instead of using this directly, we often retype this same information into a spec spreadsheet, and then again into SAS code.

Why Haven’t We Used the Specs CDISC Gave Us?

This is the real question, and it deserves reflection.

Here are a few reasons I’ve observed:

1. EDC Vendors Own the Output

Most sponsors don’t generate ODM — they receive it from EDC platforms like Medidata or Oracle. That means:

  • The structure is often proprietary or inconsistent
  • There’s little transparency into how the ODM is generated
  • There’s no incentive (yet) to clean or standardize it

2. Specs Are Still Handwritten

Even with ODM in place, many orgs still create and pass around Excel/Word-based specs for SDTM mapping. Why?

  • It feels familiar
  • Programmers are used to SAS, not metadata parsing
  • Clinical teams often work in silos from engineers

3. ODM Is Treated as an Archive, Not a Tool

The industry tends to treat ODM as a recordkeeping format—something to store, not something to build with. It was always meant to be both.

What If We Used ODM as Intended?

By parsing ODM and pairing it with a reference model (e.g., SDTMIG or CDISC Library), we can:

  • Automatically match ItemOIDs to SDTM targets
  • Generate compliant scaffolding for standard domains
  • Allow for lightweight custom overrides where needed

What I’m Building

I’ve been working on a Blueprint-as-a-Service (BaaS) framework that:

  • Parses ODM-JSON or ODM-XML
  • Matches metadata to SDTM targets
  • Generates scaffolding SQL using DBT
  • Supports overrides for custom derivations
  • Outputs traceable, auditable pipeline templates

My goal is to make SDTM programming modular, inspectable, and automatable, while still honoring the deep expertise of clinical teams and the nuances of individual studies.

I’ll be sharing more about the project soon. But for now, I want to invite others to reflect with me:

This isn’t about replacing people. It’s about respecting their time and expertise by removing busywork.

We don’t need to wait for vendors to catch up. We can start small. Build modularly. Use the standards we already have.

With some intention, we can close the gap between what’s possible and what actually gets built.

UPDATE: Project status and ongoing work are tracked here.

💬 If you’re enjoying the ideas here and want to stay connected, feel free to connect with me on LinkedIn. I’d love to stay in touch with others thinking about the future of clinical data and systems design.


Disclaimer: This article reflects my personal views only and is for informational purposes. It does not represent professional advice or the positions of any past or current employer. No confidential or proprietary information is shared, and I disclaim all liability for how you use its content. Third-party links or tool mentions are not endorsements.

Glossary of Acronyms

Acronym Definition
21 CFR Part 11 Title 21 of the U.S. Code of Federal Regulations, Part 11 (electronic records and signatures requirements for FDA-regulated industries)
ADaM Analysis Data Model (CDISC standard for analysis datasets derived from SDTM)
AWS Amazon Web Services (cloud provider used for S3, Glue, Lambda, Step Functions, Athena, etc.)
BaaS Blueprint as a Service (reusable, versioned infrastructure templates for clinical-data workflows)
BAA Business Associate Agreement (contract between a HIPAA-covered entity and its service provider to ensure protection of PHI)
CDISC Clinical Data Interchange Standards Consortium (organization that defines clinical data standards such as SDTM)
CI/CD Continuous Integration and Continuous Deployment (automated workflows for building, testing, and deploying code or infrastructure)
CRF Case Report Form (the template used to collect and standardize clinical-trial data inputs)
CRO Contract Research Organization (company providing outsourced research services to pharmaceutical and biotechnology industries)
CSV Comma-Separated Values (a common, plain-text format for tabular data exchange, often used to ingest raw clinical data)
DBT Data Build Tool (SQL-centric transformation framework often used downstream of raw-data ingestion)
DX Developer Experience (the usability and productivity of tools and platforms from a developer’s perspective)
EDC Electronic Data Capture (systems and processes for collecting clinical trial data electronically)
eCTD electronic Common Technical Document (standardized format for submitting regulatory information to health authorities)
EMA European Medicines Agency (EU regulatory agency responsible for the scientific evaluation, supervision, and safety monitoring of medicines)
EMR Electronic Medical Record (digital version of a patient’s paper chart)
ETL Extract, Transform, Load (data-processing pattern; often used interchangeably with “transform” in clinical pipelines)
FDA Food and Drug Administration (U.S. regulatory agency overseeing drug and medical device approvals)
FTP File Transfer Protocol (standard network protocol to transfer files between client and server on a computer network)
GxP “Good [insert practice]” (e.g., Good Clinical Practice, Good Manufacturing Practice; regulatory frameworks requiring auditability and validation)
GUI Graphical User Interface (visual interface allowing users to interact with software via graphical icons and indicators)
HIPAA Health Insurance Portability and Accountability Act (U.S. law establishing standards to protect patients’ health information privacy)
IaC Infrastructure as Code (e.g., Terraform or Pulumi modules used to provision cloud resources)
IAM Identity and Access Management (AWS service for managing users, roles, and permissions across AWS resources)
IQ Installation Qualification (verifies that equipment or systems are installed according to predefined requirements)
K8s Kubernetes (open-source container orchestration system for automating application deployment, scaling, and management)
Lambda AWS Lambda (serverless compute service that runs code in response to triggers)
OQ Operational Qualification (demonstrates that installed equipment or systems operate within predetermined limits under intended conditions)
Pinnacle 21 A widely used open-source validation suite for CDISC-compliant datasets
QC Quality Control (processes and checks to ensure data accuracy and integrity before analysis or submission)
R R (programming language and environment for statistical computing and graphics)
RBAC Role-Based Access Control (authorization paradigm that restricts system access to users based on assigned roles and permissions)
ROI Return on Investment (measure of the gain or loss generated relative to the resources invested)
S3 Amazon Simple Storage Service (object storage used as “raw” and “curated” buckets)
SAS Statistical Analysis System (software suite widely used for data management and analysis in biostatistics and clinical trials)
SDTM Study Data Tabulation Model (the CDISC standard for clinical-trial submission data)
SECs Statistical Computing Environments (platforms and tools where biostatisticians and programmers analyze clinical data)
SIEM Security Information and Event Management (systems that provide real-time analysis of security alerts generated by network hardware and applications)
SOP Standard Operating Procedure (detailed, written instructions to achieve uniformity of the performance of a specific function)
SFTP Secure File Transfer Protocol (network protocol that provides file transfer and manipulation functionality over SSH)
TCO Total Cost of Ownership (comprehensive assessment of direct and indirect costs related to the purchase and operation of an asset)
VPC Virtual Private Cloud (isolated section of a cloud provider’s network where you can launch resources in a virtual network you define)
VPN Virtual Private Network (encrypted connection over the internet that provides secure access to a private network)