Skip to content

OJB-Quantum/Monte-Carlo-Sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Monte-Carlo-Sim

Some interesting examples of Monte Carlo simulations performed with CUDA Python/CuPy in Google Colab. The notebooks are authored by Onri Jay Benally with citations, if relevant.

No need to download anything manually. Just run the notebooks.

Click here to render the notebooks in the browser: nbviewer


Below are some examples pulled from the Colab notebooks:

Josephson Junction Quantum Tunneling Prediction

Untitled


3D Ion Beam Etching Simulation

image


2-D Heat Equation

Untitled

Untitled


Egg White Resist Electron-Beam Penetration Simulation

Untitled

Untitled


Terabyte-Level L1 Cache Prediction

Untitled Untitled image


Semantic Shift Simulation

Untitled Untitled


GPU Library Comparison Cheat‑Sheet for Google Colab

This table summarizes practical differences when using CuPy, CUDA Python (Numba / NVIDIA cuda‑python) or Julia CUDA on Colab’s three main GPU options.

Free GPU Tier (Tesla T4, 16 GB)

Aspect CuPy CUDA Python (Numba / cuda-python) Julia CUDA
Pre‑installed? Yes cupy‑cuda12x ≥ 13.3 already in Colab base image Yes Numba present, but PTX version may lag driver No Julia kernel optional; add with Pkg.add("CUDA")
One‑liner setup (usually none)
# upgrade only if needed
pip install -q --upgrade cupy-cuda12x
pip install -q --upgrade "numba[cuda]"
plus env vars if NVVM not found
using Pkg; Pkg.add("CUDA")
Kernel authoring style NumPy‑like array ops; optional RawKernel / cupyx.jit for custom GPU code Full control: @cuda.jit on Python or embed PTX/CUDA C strings Full control: @cuda kernels in Julia
Library coverage cuBLAS, cuFFT, cuSOLVER, cuSPARSE, NCCL, cuDNN You invoke CUDA libs manually or via numba.cuda driver calls Julia wrappers for BLAS/FFT/DNN; high‑level CuArray API
Typical speed‑up vs NumPy 20–60× for vectorized math Similar if kernels tuned; launch overhead on tiny arrays Comparable; sometimes +5 % from LLVM optimizations
Common pitfalls Duplicate wheels (11× & 12×) break loader; OOM at 16 GB CUDA_ERROR_UNSUPPORTED_PTX_VERSION after Colab image update First run pre‑compiles packages (30–60 s)
Best‑fit workloads Drop‑in acceleration for array algebra, FFTs, ML inference Custom Monte‑Carlo, stencils, irregular memory access Native Julia data/ML pipelines needing GPU

Paid GPU — NVIDIA L4 (Ada Lovelace, CC 8.9, 24 GB)

Aspect CuPy CUDA Python Julia CUDA
Wheel / package cupy‑cuda12x ≥ 13.3 ships fatbin for SM 89 Numba ≥ 0.61 required for SM 89 PTX CUDA.jl auto‑detects arch
Precision extras FP8 tensor‑core matmul via precision="fp8" Need inline PTX / CUTLASS kernels for FP8 allow_fp8!() (CUDA.jl 5.1+)
Memory mgmt Pool hides cudaMalloc; 24 GB ceiling Manual or managed; same ceiling Automatic through Julia runtime
Perf vs T4 ~3 × on dense matmul / conv Similar once tuned; fewer SMs can limit occupancy Similar to CuPy
Limitations BW ≈ 300 GB/s (PCIe 4.0), not HBM Same bandwidth cap Same
Colab cost (Pro/Pro+) ≈ $0.48 hr⁻¹ (4.8 CU hr⁻¹) idem idem

Paid GPU — NVIDIA A100 40 GB (Ampere, CC 8.0, HBM2e)

Aspect CuPy CUDA Python Julia CUDA
Wheel / package Same cupy‑cuda12x covers SM 80 Numba ≥ 0.57 supports SM 80 CUDA.jl auto
Precision extras Enable TF32: cp.cuda.set_matmul_precision("tf32") @cuda.jit(fastmath=True) -> TF32 tensor cores allow_tf32!()
Memory / BW 40 GB HBM2e, 1.6 TB s⁻¹ same same
Perf gain vs T4 10–15 × on GEMM / conv Similar after occupancy tuning Similar
Session cost ≈ $1.18 hr⁻¹ (11.8 CU hr⁻¹) idem idem
Caveats Limited availability; CU burn fast Kernel launch + compile time higher Initial package compile time

Quick Recommendations

  • Start with CuPy for anything expressible as NumPy/SciPy—lowest friction, high speed.
  • Use CUDA Python only for the hotspots that need bespoke parallel patterns; stay on the latest Numba.
  • Prefer Julia CUDA if your workflow is already in Julia—performance parity with cleaner syntax.
  • Choose GPU by memory and budget: Free T4 for prototyping; L4 for moderate models with FP8; A100 when you need 40 GB or TF32 accuracy.

Yes = works out‑of‑the‑box No = requires explicit install / setup

Releases

No releases published

Packages

No packages published