We Had the Blueprint the Whole Time: Why Clinical Data Hasn’t Automated (Yet)

7 minute read

This article explores how the CDISC Operational Data Model (ODM) is underutilized, and could be the missing infrastructure layer in clinical data automation.

Image created by the author

In clinical data workflows, the technical potential has always outpaced organizational readiness.

Clinical data standards like CDISC ODM give us everything we need to automate SDTM transformations, but the industry hasn’t taken full advantage. This post explores why, and how we can finally bridge that gap.

For over two decades, CDISC has provided us with deeply thoughtful metadata standards. The Operational Data Model (ODM) wasn’t just created to document what we did; it was designed to enable automation, consistency, and reuse. It gave us a structured, machine-readable way to describe the forms, fields, controlled terminology, and relationships that make up a clinical trial.

And yet, here we are: still copying SDTM specs into Excel, writing thousands of lines of SAS code by hand, and maintaining hardcoded mappings across studies that look nearly identical.

It’s not because we’re lazy. It’s because the people doing the work are under-resourced, overburdened, and deeply siloed, and the systems that support them were never designed for modern workflows.

But now, there is opportunity to change that.

NOTE: For definitions of any unfamiliar terms, see the Glossary of Acronyms at the end of this article.

The ODM: More Than Metadata — It’s Infrastructure

Let’s start with the basics.

The CDISC Operational Data Model (ODM-XML) is a standardized format for representing:

Study design.
Case report forms.
Item definitions (with datatypes, labels, units, etc.).
Codelists and value-level metadata.
Audit trails, versioning, and relationships.

It’s verbose, yes, but it’s also deeply rich, and if we treat it as the source of truth, it can do more than describe a study. It can become the foundation for:

Automated SDTM scaffolding.
Study-agnostic data transformations.
Real-time metadata traceability.
Spec-to-code pipelines with overrides.
AI-assisted mapping and validation.

ODM often goes underutilized because we’ve never been encouraged to treat it as a functional tool.

Example: What’s Actually in an ODM File?

Here’s a simplified snippet of what you might find in a real-world ODM file:

<ItemDef OID="IT.QS.QSSTRESC" Name="Severity" DataType="text">
  <Question>Severity of Pain</Question>
  <CodeListRef CodeListOID="CL.SEVERITY_SCALE"/>
</ItemDef>

<CodeList OID="CL.SEVERITY_SCALE" Name="Severity Scale">
  <CodeListItem CodedValue="MILD"><Decode><TranslatedText>Mild</TranslatedText></Decode></CodeListItem>
  <CodeListItem CodedValue="MOD"><Decode><TranslatedText>Moderate</TranslatedText></Decode></CodeListItem>
  <CodeListItem CodedValue="SEV"><Decode><TranslatedText>Severe</TranslatedText></Decode></CodeListItem>
</CodeList>

From just this, a system could:

Infer the appropriate STRESC and STRESN logic
Scaffold the QS domain’s QSSTRESC variable with appropriate mapping
Validate data values against the codelist

Instead of using this directly, we often retype this same information into a spec spreadsheet, and then again into SAS code.

Why Haven’t We Used the Specs CDISC Gave Us?

This is the real question, and it deserves reflection.

Here are a few reasons I’ve observed:

1. EDC Vendors Own the Output

Most sponsors don’t generate ODM — they receive it from EDC platforms like Medidata or Oracle. That means:

The structure is often proprietary or inconsistent
There’s little transparency into how the ODM is generated
There’s no incentive (yet) to clean or standardize it

2. Specs Are Still Handwritten

Even with ODM in place, many orgs still create and pass around Excel/Word-based specs for SDTM mapping. Why?

It feels familiar
Programmers are used to SAS, not metadata parsing
Clinical teams often work in silos from engineers

3. ODM Is Treated as an Archive, Not a Tool

The industry tends to treat ODM as a recordkeeping format—something to store, not something to build with. It was always meant to be both.

What If We Used ODM as Intended?

By parsing ODM and pairing it with a reference model (e.g., SDTMIG or CDISC Library), we can:

Automatically match ItemOIDs to SDTM targets
Generate compliant scaffolding for standard domains
Allow for lightweight custom overrides where needed

What I’m Building

I’ve been working on a Blueprint-as-a-Service (BaaS) framework that:

Parses ODM-JSON or ODM-XML
Matches metadata to SDTM targets
Generates scaffolding SQL using DBT
Supports overrides for custom derivations
Outputs traceable, auditable pipeline templates

My goal is to make SDTM programming modular, inspectable, and automatable, while still honoring the deep expertise of clinical teams and the nuances of individual studies.

I’ll be sharing more about the project soon. But for now, I want to invite others to reflect with me:

This isn’t about replacing people. It’s about respecting their time and expertise by removing busywork.

We don’t need to wait for vendors to catch up. We can start small. Build modularly. Use the standards we already have.

With some intention, we can close the gap between what’s possible and what actually gets built.

UPDATE: Project status and ongoing work are tracked here.

💬 If you’re enjoying the ideas here and want to stay connected, feel free to connect with me on LinkedIn. I’d love to stay in touch with others thinking about the future of clinical data and systems design.

Disclaimer: This article reflects my personal views only and is for informational purposes. It does not represent professional advice or the positions of any past or current employer. No confidential or proprietary information is shared, and I disclaim all liability for how you use its content. Third-party links or tool mentions are not endorsements.

Glossary of Acronyms

Acronym	Definition
21 CFR Part 11	Title 21 of the U.S. Code of Federal Regulations, Part 11 (electronic records and signatures requirements for FDA-regulated industries)
ADaM	Analysis Data Model (CDISC standard for analysis datasets derived from SDTM)
AWS	Amazon Web Services (cloud provider used for S3, Glue, Lambda, Step Functions, Athena, etc.)
BaaS	Blueprint as a Service (reusable, versioned infrastructure templates for clinical-data workflows)
BAA	Business Associate Agreement (contract between a HIPAA-covered entity and its service provider to ensure protection of PHI)
CDISC	Clinical Data Interchange Standards Consortium (organization that defines clinical data standards such as SDTM)
CI/CD	Continuous Integration and Continuous Deployment (automated workflows for building, testing, and deploying code or infrastructure)
CRF	Case Report Form (the template used to collect and standardize clinical-trial data inputs)
CRO	Contract Research Organization (company providing outsourced research services to pharmaceutical and biotechnology industries)
CSV	Comma-Separated Values (a common, plain-text format for tabular data exchange, often used to ingest raw clinical data)
DBT	Data Build Tool (SQL-centric transformation framework often used downstream of raw-data ingestion)
DX	Developer Experience (the usability and productivity of tools and platforms from a developer’s perspective)
EDC	Electronic Data Capture (systems and processes for collecting clinical trial data electronically)
eCTD	electronic Common Technical Document (standardized format for submitting regulatory information to health authorities)
EMA	European Medicines Agency (EU regulatory agency responsible for the scientific evaluation, supervision, and safety monitoring of medicines)
EMR	Electronic Medical Record (digital version of a patient’s paper chart)
ETL	Extract, Transform, Load (data-processing pattern; often used interchangeably with “transform” in clinical pipelines)
FDA	Food and Drug Administration (U.S. regulatory agency overseeing drug and medical device approvals)
FTP	File Transfer Protocol (standard network protocol to transfer files between client and server on a computer network)
GxP	“Good [insert practice]” (e.g., Good Clinical Practice, Good Manufacturing Practice; regulatory frameworks requiring auditability and validation)
GUI	Graphical User Interface (visual interface allowing users to interact with software via graphical icons and indicators)
HIPAA	Health Insurance Portability and Accountability Act (U.S. law establishing standards to protect patients’ health information privacy)
IaC	Infrastructure as Code (e.g., Terraform or Pulumi modules used to provision cloud resources)
IAM	Identity and Access Management (AWS service for managing users, roles, and permissions across AWS resources)
IQ	Installation Qualification (verifies that equipment or systems are installed according to predefined requirements)
K8s	Kubernetes (open-source container orchestration system for automating application deployment, scaling, and management)
Lambda	AWS Lambda (serverless compute service that runs code in response to triggers)
OQ	Operational Qualification (demonstrates that installed equipment or systems operate within predetermined limits under intended conditions)
Pinnacle 21	A widely used open-source validation suite for CDISC-compliant datasets
QC	Quality Control (processes and checks to ensure data accuracy and integrity before analysis or submission)
R	R (programming language and environment for statistical computing and graphics)
RBAC	Role-Based Access Control (authorization paradigm that restricts system access to users based on assigned roles and permissions)
ROI	Return on Investment (measure of the gain or loss generated relative to the resources invested)
S3	Amazon Simple Storage Service (object storage used as “raw” and “curated” buckets)
SAS	Statistical Analysis System (software suite widely used for data management and analysis in biostatistics and clinical trials)
SDTM	Study Data Tabulation Model (the CDISC standard for clinical-trial submission data)
SECs	Statistical Computing Environments (platforms and tools where biostatisticians and programmers analyze clinical data)
SIEM	Security Information and Event Management (systems that provide real-time analysis of security alerts generated by network hardware and applications)
SOP	Standard Operating Procedure (detailed, written instructions to achieve uniformity of the performance of a specific function)
SFTP	Secure File Transfer Protocol (network protocol that provides file transfer and manipulation functionality over SSH)
TCO	Total Cost of Ownership (comprehensive assessment of direct and indirect costs related to the purchase and operation of an asset)
VPC	Virtual Private Cloud (isolated section of a cloud provider’s network where you can launch resources in a virtual network you define)
VPN	Virtual Private Network (encrypted connection over the internet that provides secure access to a private network)

Share on

X Facebook LinkedIn Bluesky

Melanie Logan

We Had the Blueprint the Whole Time: Why Clinical Data Hasn’t Automated (Yet)

The ODM: More Than Metadata — It’s Infrastructure

Example: What’s Actually in an ODM File?

Why Haven’t We Used the Specs CDISC Gave Us?

1. EDC Vendors Own the Output

2. Specs Are Still Handwritten

3. ODM Is Treated as an Archive, Not a Tool

What If We Used ODM as Intended?

What I’m Building

Glossary of Acronyms

Share on

You May Also Enjoy

When Enterprise Software Becomes an Operating Constraint

What I Look for When Evaluating a Data Stack

Are Certifications Worth It? A Practical, Opinionated Take

From Fragmented Data Flows to a Governed Data Pipeline