When Tachyum unveiled the concept of its Prodigy Universal Processor at Hot Chips 18, it made quite a splash with a chip designed to run any code using a dynamic binary translator. It demonstrated high performance when executing both native and translated code. It took the company a while to design the actual hardware, taking pre-orders on evaluation kits (opens in new tab); the company also discloses the exact specifications of its Prodigy. They certainly look impressive, but they are also scary with a 950W thermal design power per chip.

Formidable Performance at Formidable Power

Each Tachyum Prodigy processor has up to 128 proprietary cores mated with 16 DDR5 memory channels (for a 1,024-bit interface) supporting up to 7200 MT/s data transfer rate (and therefore providing up to 921.6 GBps of bandwidth) as well as 64 PCIe 5.0 lanes. In addition, the chip supports up to 8TB of DDR5 memory in total, which is in line with what we will see with upcoming server CPUs from other makers. As for clock rates, Tachyum’s Prodigy is designed to run up to 5.7 GHz and is a product of TSMC’s performance-optimized N5P process technology.

(Image credit: Golem.de)

 When it comes to performance, Tachyum expects its flagship Prodigy T16128-AIX processor (opens in new tab) to offer up to 90 FP64 TFLOPS for HPC as well as up to 12 ‘AI PetaFLOPS’ for inference and training, presumably when running native code and consuming up to 950W (and using liquid cooling), according to specifications published (opens in new tab) by the company and at Golem.de (opens in new tab). Meanwhile, Tachyum’s Prodigy processors can work in 2-way and 4-way configurations. To put the numbers into context, AMD’s Instinct MI250X has a peak throughput of 96 FP64 TFLOPS for HPC at about 560W. In contrast, Nvidia’s H100 SXM5 can provide up to 20 INT8/FP8 PetaOPS/PetaFLOPS for AI (up to 40 PetaOPS/PetaFLOPS with sparsity) at 700W. Yet, neither compute GPUs function for general-purpose workloads. And this is exactly when it gets interesting.

A New CPU Is Born

Tachyum’s Prodigy is a universal homogeneous processor packing up to 128 proprietary 64-bit VLIW cores that feature two 1024-bit vector units per core and one 4096-bit matrix unit per core. In addition, each core features a 64KB instruction cache, a 64KB data cache, 1MB L2 cache, and can utilize unused L2 caches of other cores as a victim L3 cache.

(Image credit: Tachyum)

Tachyum’s VLIW cores are in-order cores, but when compiler makers proper optimizations, they can support 4-way out-of-order issues, according to Radoslav Danilak, chief executive and co-founder of Tachuym, who spoke with Golem.de (opens in new tab). He also re-emphasized that the Prodigy instruction set architecture can achieve a very high instruction level parallelism with software using so-called poison bits.



Source link

Leave a Reply

Your email address will not be published.