Mini Project: Building an ASCII Art Engine from Scratch — Edges, Color, and a Live Webcam Feed
ASCII art is one of those ideas that sounds trivial until you actually try to make it look good. Mapping a photo to a grid of characters is a one-liner. Making it look like the photo — sharp outlines where there are edges, tonal variation where there is shade, correct proportions, real color — takes a lot more thought. This post walks through the design of image-to-ascii, a Python pipeline I built that converts images, videos, GIFs, and live webcam footage into truecolor ASCII art.
Inspiration
This project started after watching Acerola’s video I Tried Turning Games Into ASCII. He implements ASCII art as a post-process shader — running directly on the GPU, converting a rendered game frame in real time using luminance mapping, Sobel gradients for edge direction, and Difference-of-Gaussians for edge strength. The shader approach is elegant: it fits naturally into a rendering pipeline and runs fast enough to be invisible.
I wanted to understand the same ideas but outside a shader context — working on arbitrary images and video in Python, where I could inspect every intermediate array, tune every parameter, and push the output all the way to a live terminal stream. The core techniques are the same (luminance ramp, DoG + Sobel edge detection), but the implementation and the constraints are different enough that building it from scratch taught me a lot.
Why bother?
Part of the answer is that it is a genuinely fun visual problem. The constraints are strange: you have a fixed tile size, a character set with no continuous range, and glyphs that are much taller than they are wide. Working within those constraints and still producing something that looks intentional is a good engineering puzzle.
The deeper answer is that it touches a lot of real signal-processing territory — luminance mapping, edge detection, gradient analysis, coherence filtering — in a context where the output is immediately visible and interpretable. You can see your algorithm working or failing in real time, which makes iteration fast and feedback unambiguous.
The pipeline
Every frame, whether from a still image, a video, or a webcam, passes through the same five-stage pipeline:
RGB image
│
▼
1. Downsample to cell grid (block-mean average per cell)
│
▼
2. Luminance pass (brightness → character ramp)
│
▼
3. Edge strength map (Canny + Difference-of-Gaussians)
│
▼
4. Angle estimation + cell voting (Sobel gradients → | / - \)
│
▼
5. Render (text file, PNG, ANSI, or video frame)
The key design choice is that the luminance pass and the edge pass are separate and then merged. Luminance gives you fill — the tonal body of the image. Edges give you structure — the outlines and lines that define shape. You could run either alone, but together they produce output that reads like a real image.
Luminance: the easy part
Each cell in the output grid covers a rectangular block of source pixels. The mean brightness of that block is mapped to a character from a ramp, dark to light:
" .:coPO?@█" ← default (10 levels)
Wider ramps capture more tonal nuance at the cost of a busier look. The bourke ramp uses 70 characters and works well for dense photographs. stippling drops to 4 levels and produces a minimal dot-matrix aesthetic.
The one non-obvious detail here is cell aspect. Monospace characters are roughly twice as tall as they are wide, so if you sample square pixel blocks you get output that appears vertically stretched. The pipeline compensates: cell_h = cell_w × cell_aspect, so each character cell samples a taller-than-square block of pixels. For most terminals 2.0 is correct; Ghostty’s default font needs closer to 2.5.
Edges: the hard part
This is where the interesting work happens. The goal is to replace a luminance character with a directional line character (|, /, -, \) wherever an edge runs through that cell, oriented to match the actual angle of the edge.
Two complementary detectors
Canny is the standard choice for edge detection. It finds sharp, high-contrast boundaries well. But it misses softer edges — gradual transitions, the curves of a face, the outline of a cloud against a pale sky. Those edges exist; they just do not cross the gradient magnitude threshold that Canny needs.
Difference-of-Gaussians (DoG) is a complementary approach. It computes |Gaussian(σ1) − Gaussian(σ2)|, which highlights regions where the image changes across a frequency band set by the ratio of the two sigmas. The default ratio of 1.0 : 1.6 approximates the human visual system’s edge response. DoG catches edges that Canny misses and vice versa.
The combined strength map is:
strength = clip(canny + dog_weight × dog, 0, 1)
dog_weight defaults to 0.5, giving Canny primary authority but letting DoG fill gaps.
Angle estimation
Edge strength tells you where edges are. To pick the right line character you also need to know which way they run. Sobel gradients give you gx and gy per pixel, from which you get an angle. But raw per-pixel angles are noisy.
The solution is magnitude-weighted bin voting per cell. Each pixel votes for one of four angle bins (0°, 45°, 90°, 135°) with a vote weight of gradient_magnitude × edge_strength. The bin with the most vote weight wins, and its corresponding character is drawn.
The coherence gate
This is the detail that makes the biggest difference in practice.
Naive voting gives every edge cell a direction, even cells where the edge is a smudge or a texture rather than a clean line. Assigning a line character to those cells produces visual noise — random | and / in areas where no line should appear.
The fix is a coherence check. After voting, compute:
coherence = max_bin_weight / total_weight
If the dominant direction captures less than a threshold fraction of all the vote weight (default 40%), the cell is ambiguous — it might contain a cross-hatch, a texture, or overlapping edges going in multiple directions. Those cells are left as luminance characters. Only cells where one direction clearly dominates become line characters.
The result is that line characters appear on clean, unambiguous edges and nowhere else.
Rendering: three output modes
Once the character grid and per-cell colors are computed, rendering is straightforward. There are three output paths:
Plain text — a .txt file, one row per line, useful for piping or archiving.
PNG — drawn in a real monospace font (Menlo on macOS, DejaVu Sans Mono on Linux) using Pillow. Each character is colored with the mean RGB of its source cell. The result looks like a photograph reconstructed from glyphs.
Truecolor ANSI — a .ans file that encodes each cell as a 24-bit ANSI escape sequence:
\x1b[38;2;R;G;Bm<character>
Consecutive cells with the same color share an escape, so the file stays compact. View it with cat file.ans or less -R file.ans in any truecolor terminal — iTerm2, Ghostty, Kitty, Alacritty, WezTerm.
Video and GIF support
The pipeline was designed around a process_frame(rgb, cfg) function that is stateless and pure — it takes a pixel array, returns a character grid and a color grid. That design made video support almost free.
For video files, OpenCV’s VideoCapture feeds frames one at a time. Each frame goes through the pipeline and the rendered PIL image is written to a VideoWriter. For GIFs, Pillow collects all rendered frames and saves them as an animated GIF with correct frame timing.
Extension detection is automatic: .mp4, .mov, .avi, .webm, .mkv, and .gif all route to the video path.
Live webcam mode
The most visually striking feature is the live webcam stream. It reads frames from the camera at 30+ fps and renders them directly to the terminal using ANSI escape codes and an alternate screen buffer (so it never pollutes your scrollback).
The terminal interface is tunable while running. Hotkeys let you adjust width, cell aspect ratio, Canny thresholds, edge threshold, coherence threshold, color mode, character ramp, and mirror. Pressing p snapshots the current frame to a PNG. There is no config file to edit and no restart required — every parameter is live.
q / ESC quit
e toggle edges-only
+ / - width ±10
a / A cell aspect ∓0.1
[ ] canny-low ∓5
{ } canny-high ∓5
, . edge-threshold ∓0.05
< > coherence-threshold ∓0.05
t cycle color mode
r cycle ramp
m mirror
p snapshot
This is genuinely useful for tuning: you can watch edge detection respond to changes in real time, find the right coherence threshold for your terminal font, and dial in the look you want without running the pipeline repeatedly on a test image.
Architecture
The codebase is split into focused modules with no circular dependencies:
| Module | Responsibility |
|---|---|
ascii.py |
Luminance ramp definitions and block-mean downsampling |
edges.py |
Canny, DoG, Sobel, angle quantisation, cell voting, coherence gate |
render.py |
Text, ANSI, and PIL/PNG output |
colors.py |
Named color parsing (white, amber, #ff8800, 255,136,0) |
presets.py |
Named CLI flag bundles (e.g. --preset amber-pop) |
pipeline.py |
process_frame + run (image) + run_video (video/GIF) |
cli.py |
argparse entry point for image/video/GIF |
live.py |
Webcam loop, terminal rendering, hotkey handling |
process_frame is the core. Everything else is either preparing data for it or consuming its output.
What I would do differently
Parallel frame rendering. Video processing is currently sequential. Each frame blocks on the pipeline before the next is read. For long videos, a producer-consumer queue with worker threads rendering frames in parallel would be a straightforward win.
Terminal font metrics. The default cell_aspect of 2.0 is correct for many terminals but not all, and the webcam mode ships a different default (2.5) that matches Ghostty better. Ideally the live mode would query the terminal for actual cell dimensions using the \x1b[16t sequence and compute the correct aspect automatically.
Streaming ANSI for video. The current video path renders to mp4 or gif. Streaming ANSI directly to the terminal (like the webcam mode does) for a local video file would be a low-overhead alternative to writing a full video file.
Stack
- Python 3.14
- NumPy — all array operations
- OpenCV — Canny, Gaussian blur, Sobel, VideoCapture, VideoWriter
- Pillow — PNG and GIF output
uv— dependency management and virtual environment
No ML, no neural networks. Everything here is classical signal processing.
Try it
git clone <repo>
uv sync
# Still image
uv run python -m src.cli photo.jpg -o output/out --width 160
# Live webcam
uv run python -m src.live
# Video
uv run python -m src.cli clip.mp4 -o output/clip
# Edges only
uv run python -m src.cli photo.jpg -o output/edges --edges-only --edge-threshold 0.05