Some interesting examples of Monte Carlo simulations performed with CUDA Python/CuPy in Google Colab. The notebooks are authored by Onri Jay Benally with citations, if relevant.
No need to download anything manually. Just run the notebooks.
Josephson Junction Quantum Tunneling Prediction
3D Ion Beam Etching Simulation
2-D Heat Equation
Egg White Resist Electron-Beam Penetration Simulation
Terabyte-Level L1 Cache Prediction
Semantic Shift Simulation
This table summarizes practical differences when using CuPy, CUDA Python (Numba / NVIDIA cuda‑python
) or Julia CUDA on Colab’s three main GPU options.
Aspect | CuPy | CUDA Python (Numba / cuda-python ) |
Julia CUDA |
---|---|---|---|
Pre‑installed? | Yes cupy‑cuda12x ≥ 13.3 already in Colab base image |
Yes Numba present, but PTX version may lag driver | No Julia kernel optional; add with Pkg.add("CUDA") |
One‑liner setup | (usually none)# upgrade only if needed pip install -q --upgrade cupy-cuda12x |
pip install -q --upgrade "numba[cuda]" plus env vars if NVVM not found |
using Pkg; Pkg.add("CUDA") |
Kernel authoring style | NumPy‑like array ops; optional RawKernel / cupyx.jit for custom GPU code |
Full control: @cuda.jit on Python or embed PTX/CUDA C strings |
Full control: @cuda kernels in Julia |
Library coverage | cuBLAS, cuFFT, cuSOLVER, cuSPARSE, NCCL, cuDNN | You invoke CUDA libs manually or via numba.cuda driver calls |
Julia wrappers for BLAS/FFT/DNN; high‑level CuArray API |
Typical speed‑up vs NumPy | 20–60× for vectorized math | Similar if kernels tuned; launch overhead on tiny arrays | Comparable; sometimes +5 % from LLVM optimizations |
Common pitfalls | Duplicate wheels (11× & 12×) break loader; OOM at 16 GB | CUDA_ERROR_UNSUPPORTED_PTX_VERSION after Colab image update |
First run pre‑compiles packages (30–60 s) |
Best‑fit workloads | Drop‑in acceleration for array algebra, FFTs, ML inference | Custom Monte‑Carlo, stencils, irregular memory access | Native Julia data/ML pipelines needing GPU |
Aspect | CuPy | CUDA Python | Julia CUDA |
---|---|---|---|
Wheel / package | cupy‑cuda12x ≥ 13.3 ships fatbin for SM 89 |
Numba ≥ 0.61 required for SM 89 PTX | CUDA.jl auto‑detects arch |
Precision extras | FP8 tensor‑core matmul via precision="fp8" |
Need inline PTX / CUTLASS kernels for FP8 | allow_fp8!() (CUDA.jl 5.1+) |
Memory mgmt | Pool hides cudaMalloc ; 24 GB ceiling |
Manual or managed; same ceiling | Automatic through Julia runtime |
Perf vs T4 | ~3 × on dense matmul / conv | Similar once tuned; fewer SMs can limit occupancy | Similar to CuPy |
Limitations | BW ≈ 300 GB/s (PCIe 4.0), not HBM | Same bandwidth cap | Same |
Colab cost (Pro/Pro+) | ≈ $0.48 hr⁻¹ (4.8 CU hr⁻¹) | idem | idem |
Aspect | CuPy | CUDA Python | Julia CUDA |
---|---|---|---|
Wheel / package | Same cupy‑cuda12x covers SM 80 |
Numba ≥ 0.57 supports SM 80 | CUDA.jl auto |
Precision extras | Enable TF32: cp.cuda.set_matmul_precision("tf32") |
@cuda.jit(fastmath=True) -> TF32 tensor cores |
allow_tf32!() |
Memory / BW | 40 GB HBM2e, 1.6 TB s⁻¹ | same | same |
Perf gain vs T4 | 10–15 × on GEMM / conv | Similar after occupancy tuning | Similar |
Session cost | ≈ $1.18 hr⁻¹ (11.8 CU hr⁻¹) | idem | idem |
Caveats | Limited availability; CU burn fast | Kernel launch + compile time higher | Initial package compile time |
- Start with CuPy for anything expressible as NumPy/SciPy—lowest friction, high speed.
- Use CUDA Python only for the hotspots that need bespoke parallel patterns; stay on the latest Numba.
- Prefer Julia CUDA if your workflow is already in Julia—performance parity with cleaner syntax.
- Choose GPU by memory and budget: Free T4 for prototyping; L4 for moderate models with FP8; A100 when you need 40 GB or TF32 accuracy.
Yes = works out‑of‑the‑box No = requires explicit install / setup