Tech
Microsoft Copilot Cowork Exfiltrates FilesNorway's 2 petabytes of Huawei flash storage and LLM trainingExit IP VPN servers mitigation rolloutCalifornia moves to exempt Linux from its age-verification law after backlashShow HN: Write your BPF programs in Go, not CMagnifica HumanitasNinth Circuit Panel Goes Out of Its Way to Question Section 230–DOE vs. MetaToshifumi Suzuki, founder of Seven-Eleven Japan, has diedC extensions, portability, and alternative compilersJensen–Shannon DivergenceMicrosoft Copilot Cowork Exfiltrates FilesNorway's 2 petabytes of Huawei flash storage and LLM trainingExit IP VPN servers mitigation rolloutCalifornia moves to exempt Linux from its age-verification law after backlashShow HN: Write your BPF programs in Go, not CMagnifica HumanitasNinth Circuit Panel Goes Out of Its Way to Question Section 230–DOE vs. MetaToshifumi Suzuki, founder of Seven-Eleven Japan, has diedC extensions, portability, and alternative compilersJensen–Shannon Divergence
Engineering Log

Things I built.
Things that broke.

Real notes from building in public — debugging sessions, architecture decisions, and observations from shipping production systems.

For my real estate marketplace project, Kiray, my rule is strict type safety. The backend uses a NestJS and TypeScript architecture with explicit types everywhere. But when I started building the background image processing using AWS Lambda, my stack strategy hit a major real-world hurdle.

To handle high-speed image resizing and compression, I chose sharp. Sharp is fast because it uses a native C++ library called libvips under the hood rather than pure JavaScript.

However, because it uses native C++ code, it compiles itself specifically for the operating system where you run npm install. When I installed it on my Mac, npm downloaded the macOS binary. But when that code ran on AWS Lambda (which runs Amazon Linux 2), the function immediately crashed with a dynamic linking mismatch error:

Error: /opt/nodejs/node_modules/sharp/build/Release/sharp-linux-x64.node: invalid ELF header

The Lambda container encountered a macOS binary on a Linux system. To fix this compilation mismatch, I had to force npm to target the production environment explicitly during development:

npm install sharp --platform=linux --arch=x64 --libc=glibc

Once I had the correct Linux binary, I had to decide how to deploy it. My first instinct was to zip up the node_modules folder with each individual Lambda function. The math quickly showed why this is a poor approach:

  • Sharp's Linux binary bundle size: ~8.4MB
  • Independent functions needing Sharp: 2 (generate-thumbnail and generate-medium)
  • Total impact: 16.8MB of duplicate files traveling through the deployment pipeline.

This duplication slows down deployment speeds and inflates function package sizes unnecessarily. To solve this, I decoupled the heavy dependency using an AWS Lambda Layer.

Layer Strategy:
└── kiray-sharp-layer (8.4MB Payload — Deployed Once)
    └── nodejs/node_modules/sharp/ (Linux x64 Binary)

├── generate-thumbnail/index.js (< 10KB Zip — References Layer)
└── generate-medium/index.js (< 10KB Zip — References Layer)

By uploading Sharp once to a shared layer, the individual zip files for my functions dropped from megabytes down to under 10KB.

While the rest of Kiray is strict TypeScript, I consciously chose to write these specific Lambda handlers in pure JavaScript for three reasons:

1. Build Step Overhead: AWS Lambda does not run TypeScript natively. Adding a compilation toolchain (like tsc or a bundler) for isolated utility functions that are only 50 lines long adds unnecessary complexity to the CI/CD pipeline.

2. Simpler Debugging: Troubleshooting native C++ binaries in the cloud is already complicated. Adding an abstract code compilation layer on top makes it harder to isolate low-level bugs.

3. Diminishing Returns: These image utility handlers are small, single-purpose, and functionally stable. The benefit of static type checking here is low compared to the maintenance effort of managing the extra build tools.

What this reinforced:Local success doesn't guarantee cloud success. When dealing with native modules, what works flawlessly on your laptop can easily fail in production if the underlying operating systems don't match. Knowing when to choose simplicity over a strict architectural principle is part of building real, high-velocity systems.

Browser DevTools Network tab showing a POST request to localhost:8000 from the production frontend
Every API call from the deployed app was going to localhost:8000. ERR_CONNECTION_REFUSED on every request.

Registration failed on the deployed app. The backend was running. The ALB was routing correctly. But every POST to /api/auth/register came back with ERR_CONNECTION_REFUSED.

Opening DevTools showed the problem immediately. The request URL was http://localhost:8000/api/auth/register. Not the ALB. Not a relative path. Literally localhost — from a browser tab pointed at an AWS load balancer.

The frontend code looked fine:

baseURL: import.meta.env.VITE_API_URL || "http://localhost:8000/api"

The || "http://localhost:8000/api" fallback was the problem. Vite reads environment variables at build time and bakes them into the JavaScript bundle. If VITE_API_URL isn't set when npm run build runs, the fallback becomes permanent — compiled into the bundle, shipped to every browser.

CI was building the Docker image without passing the variable. The Dockerfile had no ARG for it. So every production build was silently shipping the localhost fallback.

Environment variables in Vite aren't runtime config. They're compile-time constants. Whatever value exists when the build runs is the value every user gets.

Two fixes together. First, the Dockerfile:

ARG VITE_API_URL=/api
ENV VITE_API_URL=$VITE_API_URL

Second, the CI build step:

docker build --build-arg VITE_API_URL=/api -t $IMAGE_TAG .

The relative /api URL resolves against whatever origin the browser is pointed at. Behind the ALB that becomes the correct backend endpoint automatically — no hardcoded hostnames, works across environments.

What this reinforced: Vite's import.meta.envvariables are not runtime environment variables. They're replaced at build time by the bundler. If your CI pipeline builds Docker images without passing the right build args, the fallback values ship to production silently. Check the Network tab before assuming the backend is the problem.

Grafana dashboard showing all four panels with Datasource not found error
All four panels. Same error. Prometheus was running fine. The data was there. Grafana just couldn't find it.

Prometheus was scraping the backend. The metrics were flowing. I could query them directly in Grafana's Explore mode and see real data — request rates, memory, CPU. But every panel on the provisioned dashboard showed "Datasource not found."

The first instinct was to check the Prometheus container. It was running. Scrape targets were up. The same PromQL queries worked fine in Explore. The pipeline was working.

The problem wasn't the data. The problem was how the dashboard was looking for the data source.

When you provision a Grafana datasource without specifying a uid, Grafana auto-generates one at startup — something like PBFA97CFB59. The dashboard JSON, however, had a hardcoded reference: "uid": "prometheus". Those two strings didn't match. Every panel tried to look up a datasource by the uid prometheus, found nothing, and gave up.

Explore worked because it falls back to the default datasource. Provisioned dashboards don't have that fallback — they resolve by uid exactly.

One line added to the datasource provisioning file:

uid: prometheus

That's it. The dashboard's hardcoded reference now matched. All four panels populated within seconds.

What this reinforced:provisioning-as-code in Grafana has a coupling between datasources and dashboards that the docs don't make obvious. They show you how to provision each side independently. They don't show you that the uidon both sides has to match. Check the uid before assuming the data isn't there.

GitHub Actions DAG showing deploy-backend correctly waiting for deploy-prometheus and deploy-grafana after the fix.
GitHub Actions DAG showing deploy-backend correctly waiting for deploy-prometheus and deploy-grafana after the fix.

I was wiring up a new Grafana sidecar in the Demena CI pipeline when I noticed something off. Three deploy jobs — deploy-backend, deploy-prometheus, deploy-grafana — all declared needs: test. None of them said they needed each other. The moment tests passed, GitHub Actions would start all three on separate runners, in parallel.

That parallelism is the default for independent jobs. The problem was these jobs were not independent.

deploy-backend does two things in sequence: build/push the backend image to ECR, then update the ECS task definition and trigger a rollout. The task definition references three images — backend, prometheus, grafana. ECS pulls all three when it rolls a new task.

If deploy-backend finished before deploy-prometheus or deploy-grafana had pushed their :latest tags, ECS would pull the previous :latest and silently roll a task running stale sidecar config.

The deploy would succeed. Tests would be green. Logs would look fine. And Prometheus would be running scrape rules from yesterday.

The fix was one line on the backend job:

needs: [test, deploy-prometheus, deploy-grafana]

Trades 60–90 seconds of pipeline time for deploy correctness. Worth it.

What this reinforced:parallelism in CI is correct by default for independent jobs, but ECS task definitions create implicit dependencies between deploy jobs that GitHub Actions can not see. The DAG of "what depends on what" lives in your task definition, not your workflow file. If you only model dependencies that GitHub knows about, you'll race yourself.

Screenshot of GitHub Actions showing green checkmarks for test-backend and build-frontend with a successful ECS deployment
The final state: All 25 tests passing in the Ubuntu runner before handing off to the ECS rolling deployment.

Everything was green on my MacBook. The moment I pushed the code, the pipeline collapsed. The error: ECONNREFUSED 127.0.0.1:3306. The database was refusing connections that should have worked.

Six hours of debugging revealed three separate problems that looked like one.

First: localhostwasn't resolving to 127.0.0.1 in the GitHub Actions runner — it was resolving to ::1(IPv6). MySQL wasn't listening there.

Second: dotenv.config() was loading zero variables from a missing .env file and silently overriding the environment variables I had set in the GitHub Actions YAML. Zero noise. Just wrong values.

The actual problem (and the hardest to see):MySQL health checks were passing, but the service wasn't actually ready to accept connections yet. The runner said "up." MySQL said otherwise.

Healthy container != Ready service. Docker handles the process, but the database engine handles the socket. Don't trust the status light; trust the ping.

The fix that unlocked everything was a bash wait script — 30 attempts, 2-second intervals, pinging mysqladmin until it actually responded.

# The wait-for-it loop
for i in {1..30}; do
  if mysqladmin ping -h"127.0.0.1" --silent; then
    echo "MySQL is up!"
    exit 0
  fi
  sleep 2
done

What this reinforced:Networking in CI isn't the same as your local machine. You have to verify the full chain — IPv4 precedence, environment priority, and actual service readiness. The pipeline has been solid since.

DNS chain diagram showing registrar, nameserver, DNS records and domain resolution with AWS and Vercel icons
The broken chain (Route 53 in control) vs the fixed chain (GoDaddy + Vercel records)

Vercel build was green. The Vercel URL loaded fine. I typed amtenu.ca and got nothing — site not found, no error, no indication of what was wrong.

The actual problem: GoDaddy wasn't the authoritative nameserver. AWS Route 53 was — left over from a previous infrastructure setup. Every record I edited in GoDaddy was being ignored. Route 53 was answering all queries for amtenu.ca, and Route 53 had no Vercel records.

Registrar → points to nameserver → nameserver holds the actual records. They can be completely different providers. Check who is answering before you edit anything.

Fix was straightforward once I understood the chain — switch nameservers back to GoDaddy, delete the conflicting records, add the correct Vercel A record and CNAME. Then wait.

Propagation took 20 hours. Used dnschecker.org to watch it go green region by region. amtenu.ca is live.

What this reinforced:DNS debugging means verifying the full chain — not just checking that records look right in one provider's UI. The system doesn't tell you when your edits are being ignored.

← Back home