We Had the Blueprint the Whole Time: Why Clinical Data Hasn’t Automated (Yet)

Image created by the author
In clinical data workflows, the technical potential has always outpaced organizational readiness.
Clinical data standards like CDISC ODM give us everything we need to automate SDTM transformations, but the industry hasn’t taken full advantage. This post explores why, and how we can finally bridge that gap.
For over two decades, CDISC has provided us with deeply thoughtful metadata standards. The Operational Data Model (ODM) wasn’t just created to document what we did; it was designed to enable automation, consistency, and reuse. It gave us a structured, machine-readable way to describe the forms, fields, controlled terminology, and relationships that make up a clinical trial.
And yet, here we are: still copying SDTM specs into Excel, writing thousands of lines of SAS code by hand, and maintaining hardcoded mappings across studies that look nearly identical.
It’s not because we’re lazy. It’s because the people doing the work are under-resourced, overburdened, and deeply siloed, and the systems that support them were never designed for modern workflows.
But now, there is opportunity to change that.
NOTE: For definitions of any unfamiliar terms, see the Glossary of Acronyms at the end of this article.
The ODM: More Than Metadata — It’s Infrastructure
Let’s start with the basics.
The CDISC Operational Data Model (ODM-XML) is a standardized format for representing:
- Study design.
- Case report forms.
- Item definitions (with datatypes, labels, units, etc.).
- Codelists and value-level metadata.
- Audit trails, versioning, and relationships.
It’s verbose, yes, but it’s also deeply rich, and if we treat it as the source of truth, it can do more than describe a study. It can become the foundation for:
- Automated SDTM scaffolding.
- Study-agnostic data transformations.
- Real-time metadata traceability.
- Spec-to-code pipelines with overrides.
- AI-assisted mapping and validation.
ODM often goes underutilized because we’ve never been encouraged to treat it as a functional tool.
Example: What’s Actually in an ODM File?
Here’s a simplified snippet of what you might find in a real-world ODM file:
<ItemDef OID="IT.QS.QSSTRESC" Name="Severity" DataType="text">
<Question>Severity of Pain</Question>
<CodeListRef CodeListOID="CL.SEVERITY_SCALE"/>
</ItemDef>
<CodeList OID="CL.SEVERITY_SCALE" Name="Severity Scale">
<CodeListItem CodedValue="MILD"><Decode><TranslatedText>Mild</TranslatedText></Decode></CodeListItem>
<CodeListItem CodedValue="MOD"><Decode><TranslatedText>Moderate</TranslatedText></Decode></CodeListItem>
<CodeListItem CodedValue="SEV"><Decode><TranslatedText>Severe</TranslatedText></Decode></CodeListItem>
</CodeList>
From just this, a system could:
- Infer the appropriate STRESC and STRESN logic
- Scaffold the QS domain’s QSSTRESC variable with appropriate mapping
- Validate data values against the codelist
Instead of using this directly, we often retype this same information into a spec spreadsheet, and then again into SAS code.
Why Haven’t We Used the Specs CDISC Gave Us?
This is the real question, and it deserves reflection.
Here are a few reasons I’ve observed:
1. EDC Vendors Own the Output
Most sponsors don’t generate ODM — they receive it from EDC platforms like Medidata or Oracle. That means:
- The structure is often proprietary or inconsistent
- There’s little transparency into how the ODM is generated
- There’s no incentive (yet) to clean or standardize it
2. Specs Are Still Handwritten
Even with ODM in place, many orgs still create and pass around Excel/Word-based specs for SDTM mapping. Why?
- It feels familiar
- Programmers are used to SAS, not metadata parsing
- Clinical teams often work in silos from engineers
3. ODM Is Treated as an Archive, Not a Tool
The industry tends to treat ODM as a recordkeeping format—something to store, not something to build with. It was always meant to be both.
What If We Used ODM as Intended?
By parsing ODM and pairing it with a reference model (e.g., SDTMIG or CDISC Library), we can:
- Automatically match ItemOIDs to SDTM targets
- Generate compliant scaffolding for standard domains
- Allow for lightweight custom overrides where needed
What I’m Building
I’ve been working on a Blueprint-as-a-Service (BaaS) framework that:
- Parses ODM-JSON or ODM-XML
- Matches metadata to SDTM targets
- Generates scaffolding SQL using DBT
- Supports overrides for custom derivations
- Outputs traceable, auditable pipeline templates
My goal is to make SDTM programming modular, inspectable, and automatable, while still honoring the deep expertise of clinical teams and the nuances of individual studies.
I’ll be sharing more about the project soon. But for now, I want to invite others to reflect with me:
This isn’t about replacing people. It’s about respecting their time and expertise by removing busywork.
We don’t need to wait for vendors to catch up. We can start small. Build modularly. Use the standards we already have.
With some intention, we can close the gap between what’s possible and what actually gets built.
💬 If you’re enjoying the ideas here and want to stay connected, feel free to connect with me on LinkedIn. I’d love to stay in touch with others thinking about the future of clinical data and systems design.
Disclaimer: This article reflects my personal views only and is for informational purposes. It does not represent professional advice or the positions of any past or current employer. No confidential or proprietary information is shared, and I disclaim all liability for how you use its content. Third-party links or tool mentions are not endorsements.
Glossary of Acronyms
| Acronym | Definition |
|---|---|
| 21 CFR Part 11 | Title 21 of the U.S. Code of Federal Regulations, Part 11 (electronic records and signatures requirements for FDA-regulated industries) |
| ADaM | Analysis Data Model (CDISC standard for analysis datasets derived from SDTM) |
| AWS | Amazon Web Services (cloud provider used for S3, Glue, Lambda, Step Functions, Athena, etc.) |
| BaaS | Blueprint as a Service (reusable, versioned infrastructure templates for clinical-data workflows) |
| BAA | Business Associate Agreement (contract between a HIPAA-covered entity and its service provider to ensure protection of PHI) |
| CDISC | Clinical Data Interchange Standards Consortium (organization that defines clinical data standards such as SDTM) |
| CI/CD | Continuous Integration and Continuous Deployment (automated workflows for building, testing, and deploying code or infrastructure) |
| CRF | Case Report Form (the template used to collect and standardize clinical-trial data inputs) |
| CRO | Contract Research Organization (company providing outsourced research services to pharmaceutical and biotechnology industries) |
| CSV | Comma-Separated Values (a common, plain-text format for tabular data exchange, often used to ingest raw clinical data) |
| DBT | Data Build Tool (SQL-centric transformation framework often used downstream of raw-data ingestion) |
| DX | Developer Experience (the usability and productivity of tools and platforms from a developer’s perspective) |
| EDC | Electronic Data Capture (systems and processes for collecting clinical trial data electronically) |
| eCTD | electronic Common Technical Document (standardized format for submitting regulatory information to health authorities) |
| EMA | European Medicines Agency (EU regulatory agency responsible for the scientific evaluation, supervision, and safety monitoring of medicines) |
| EMR | Electronic Medical Record (digital version of a patient’s paper chart) |
| ETL | Extract, Transform, Load (data-processing pattern; often used interchangeably with “transform” in clinical pipelines) |
| FDA | Food and Drug Administration (U.S. regulatory agency overseeing drug and medical device approvals) |
| FTP | File Transfer Protocol (standard network protocol to transfer files between client and server on a computer network) |
| GxP | “Good [insert practice]” (e.g., Good Clinical Practice, Good Manufacturing Practice; regulatory frameworks requiring auditability and validation) |
| GUI | Graphical User Interface (visual interface allowing users to interact with software via graphical icons and indicators) |
| HIPAA | Health Insurance Portability and Accountability Act (U.S. law establishing standards to protect patients’ health information privacy) |
| IaC | Infrastructure as Code (e.g., Terraform or Pulumi modules used to provision cloud resources) |
| IAM | Identity and Access Management (AWS service for managing users, roles, and permissions across AWS resources) |
| IQ | Installation Qualification (verifies that equipment or systems are installed according to predefined requirements) |
| K8s | Kubernetes (open-source container orchestration system for automating application deployment, scaling, and management) |
| Lambda | AWS Lambda (serverless compute service that runs code in response to triggers) |
| OQ | Operational Qualification (demonstrates that installed equipment or systems operate within predetermined limits under intended conditions) |
| Pinnacle 21 | A widely used open-source validation suite for CDISC-compliant datasets |
| QC | Quality Control (processes and checks to ensure data accuracy and integrity before analysis or submission) |
| R | R (programming language and environment for statistical computing and graphics) |
| RBAC | Role-Based Access Control (authorization paradigm that restricts system access to users based on assigned roles and permissions) |
| ROI | Return on Investment (measure of the gain or loss generated relative to the resources invested) |
| S3 | Amazon Simple Storage Service (object storage used as “raw” and “curated” buckets) |
| SAS | Statistical Analysis System (software suite widely used for data management and analysis in biostatistics and clinical trials) |
| SDTM | Study Data Tabulation Model (the CDISC standard for clinical-trial submission data) |
| SECs | Statistical Computing Environments (platforms and tools where biostatisticians and programmers analyze clinical data) |
| SIEM | Security Information and Event Management (systems that provide real-time analysis of security alerts generated by network hardware and applications) |
| SOP | Standard Operating Procedure (detailed, written instructions to achieve uniformity of the performance of a specific function) |
| SFTP | Secure File Transfer Protocol (network protocol that provides file transfer and manipulation functionality over SSH) |
| TCO | Total Cost of Ownership (comprehensive assessment of direct and indirect costs related to the purchase and operation of an asset) |
| VPC | Virtual Private Cloud (isolated section of a cloud provider’s network where you can launch resources in a virtual network you define) |
| VPN | Virtual Private Network (encrypted connection over the internet that provides secure access to a private network) |