State Transportation Agency
AI-Powered Data Discovery Gives State Transportation Agency a Clear Migration Path
A state transportation agency faced a familiar challenge – critical operational data was locked inside three siloed construction management systems with no centralized platform or governance in place. Allata conducted a six-week, AI-enabled data discovery engagement on AWS managed services that focused on profiling over 550 tables to build a working proof of concept (POC) with SageMaker Unified Studio and generative AI tooling, to deliver a production-ready roadmap. The result: a validated architectural blueprint, prioritized data products, and a clear four-month path to a governed data lakehouse that positions the agency for faster migrations and smarter decisions.
OVERVIEW
Our client needed to understand what lived inside its legacy databases before it could modernize. Allata delivered a complete data discovery and POC that cataloged three mission-critical systems, surfaced data quality risks, and demonstrated AI-driven exploration, all within six weeks.
The engagement produced working infrastructure on AWS that included ingestion pipelines, governed storage, an AI chatbot for schema introspection, and natural-language querying against newly cataloged data. This gave the agency tangible evidence that a modern data platform was both achievable and practical.
- Profiled 550+ tables across three legacy SQL Server systems; identified fewer than 40 core fact tables driving most of the reporting value.
- Delivered a validated proof of concept covering ingestion, governance, cataloging, and AI-assisted consumption.
- Produced a four-month implementation roadmap with defined deliverables, team structure, and success criteria.
THE CHALLENGE
The agency operated three independent construction management applications. Each application was built on its own SQL server, with its own data model, security approach, and integration patterns. Together, these systems held decades of project, maintenance, and materials data, yet no centralized platform existed to access, govern, or analyze that information holistically.
Across the three systems, stored procedures numbered in the hundreds, data types drifted between databases (decimal precision varied from system to system), and delete logic was handled differently everywhere. One system used soft-delete flags, another used void dates, and a third relied on active/inactive character codes. Employee identifiers, org codes, and route numbers appeared in all three systems with no referential enforcement.
Allata’s client was planning to modernize two to four critical applications, but extracting and trusting the underlying data was a prerequisite that had no clear path. Semi-structured payloads, 28 XML-generating functions in one system alone, plus comma-delimited values stored in 14 columns added complexity to any future extraction effort.
Additionally, there was no data catalog, no automated quality checks, and no way to leverage AI tooling against its own information. Business users and analysts lacked self-service access to cross-system insights.
- Three siloed SQL Server systems with no centralized data platform or governance structure.
- Hundreds of stored procedures, inconsistent data types, and mixed delete semantics across databases.
- 28 XML-generating functions and 14 comma-delimited columns requiring shredding before any migration.
- Application modernization was blocked by the inability to assess, clean, and unify source data.
OUR SOLUTION
Allata began by cataloging and deeply profiling each of the three legacy databases, meeting with system owners, mapping data flows, documenting integration points, and identifying pain points. The team then evaluated AWS managed services to determine the right combination for the agency’s go-forward data platform, with particular attention to AI-assisted data understanding.
Using the discovery findings, Allata built a functional POC on AWS that validated six core platform capabilities end to end: ingestion via DMS, governed storage through Lake Formation, automated catalog refreshes, a SageMaker Unified Studio domain with Bedrock-powered AI, natural-language data exploration, and curated asset publishing through Visual Flow Builder and Jupyter Notebooks.
- Ingested legacy data from all three systems into S3 as Parquet using AWS Database Migration Service in a pure ELT pattern
- Established governed storage with a CDK-deployed stack provisioning Lake. Formation roles, Glue databases, crawlers, and data-level permissions per source
- Built a schema-aware AI chatbot in SageMaker Unified Studio backed by an S3 knowledge base containing each database’s DDL statements
- Proved AI-driven querying using Amazon Q and Nova Premier to generate and refine Athena SQL against the newly cataloged data
- Delivered a prioritized roadmap identifying five high-value data products, five key shared dimensions, and five critical data quality issues to resolve before migration.
THE RESULT
The six-week engagement confirmed that the data, while complex, was predictable and manageable. Every architectural assumption was tested against real data, and demonstrated that generative AI could accelerate data prioritization, gap identification, and pipeline development on the platform.
Allata delivered a four-month implementation roadmap organized into four stages (Foundation, Governed Ingestion, Processed Silver Tables, Automated Quality and Audit), with a four-person core team, specific deliverables, and measurable success criteria for each stage.
- Identified that fewer than 40 core tables out of 550+ drive 95% of the agency’s reporting workload—focusing migration on a small, high-impact subset.
- Confirmed that generative AI auto-generated catalog descriptions as a quick win to improve data understanding across the organization.
- Found that Anthropic models outperformed AWS-native LLMs for schema introspection and SQL generation tasks.
- Established that natural-language querying via Athena works effectively once data is cataloged and accessible, removing SQL expertise as a barrier.
- Produced a production-ready Infrastructure-as-Code repository pattern with additive CDK stacks, ensuring core resources are never replaced and rollbacks remain safe.
technology
Allata leveraged AWS-native managed services throughout the engagement, selected for lower operational overhead, proven scalability, and alignment with the agency’s existing cloud footprint.
- AWS SageMaker Unified Studio — centralized interface for data governance, AI, and analytics projects
- Amazon Bedrock (Anthropic models, Nova Premier) — generative AI for schema introspection chatbot and SQL generation
- Amazon Q — AI-assisted code writing, SQL generation, and optimization within JupyterLab and Athena
- AWS Database Migration Service (DMS) — full-load and CDC ingestion from SQL Server to S3 (Parquet)
- AWS Lake Formation — tag-based fine-grained access control and governance
- AWS Glue (Crawlers, ETL, Data Quality) — automated cataloging, transformation pipelines, and quality rules
- Amazon Athena — serverless SQL querying over cataloged S3 data
- AWS CDK (Infrastructure-as-Code) — repeatable, version-controlled deployment of all platform stacks
- Amazon S3 (Iceberg/Parquet) — lakehouse storage with bronze and silver data layers
Innovation starts with a conversation.
Fill out this email form and we’ll connect you with the right person for your needs.
Related Case Studies
AI-Assisted Reports Review Reduced Manual Effort by 90%
- Energy & Utilities
- Artificial Intelligence
Energy & Utilities, Artificial Intelligence
AI-Powered Waste Profiling Cuts Rework by Half and Accelerates Approvals for a Leading National Environmental Services Provider
- Energy & Utilities
- Artificial Intelligence
Energy & Utilities, Artificial Intelligence
Leader in Modular Space Builds AI Foundation in Eight Weeks Enabling Enterprise-Wide Adoption
- Real Estate & Construction
- Artificial Intelligence
Real Estate & Construction, Artificial Intelligence