Inference Performance: Good, But Missing Tensor APIs

Beyond CPU and GPU, the one aspect of the Snapdragon 855 that Qualcomm made a lot of noise about is the new Hexagon 690 accelerator block.

The new unit doubles its vector pipelines, essentially doubling performance for traditional image processing tasks as well as machine inferencing workloads. Most importantly, Qualcomm now includes a dedicated “Tensor Accelerator” block which promises to even better offload inferencing tasks.

I’ve queried Qualcomm about the new Tensor Accelerator, and got some interesting answers. First of all- Qualcomm isn’t willing to disclose more about the performance of this IP block; the company had advertised a total of “7 TOPS” computing power on the part of the platform, but they would not dissect this figure and attribute it individually to each IP block.  

 

What was actually most surprising however was the API situation for the new Tensor accelerator. Unfortunately, the block will not be exposed to the NNAPI until sometime later in the year for Android Q, and for the time being the accelerator is only exposed via in-house frameworks. What this means is that none of our very limited set of “AI” benchmarks is able to actually test the Tensor block, and most of what we’re going to see in terms of results are merely improvements on the side of the Hexagon’s vector cores.

Inference Performance

First off, we start off with “AiBenchmark” – we first starred the new workload in our Mate 20 review, to quote myself:

“AI-Benchmark” is a new tool developed by Andrey Ignatov from the Computer Vision Lab at ETH Zürich in Switzerland. The new benchmark application, is as far as I’m aware, one of the first to make extensive use of Android’s new NNAPI, rather than relying on each SoC vendor’s own SDK tools and APIs. This is an important distinction to AIMark, as AI-Benchmark should be better able to accurately represent the resulting NN performance as expected from an application which uses the NNAPI.

Andrey extensive documents the workloads such as the NN models used as well as what their function is, and has also published a paper on his methods and findings.

One thing to keep in mind, is that the NNAPI isn’t just some universal translation layer that is able to magically run a neural network model on an NPU, but the API as well as the SoC vendor’s underlying driver must be able to support the exposed functions and be able to run this on the IP block. The distinction here lies between models which use features that are to date not yet supported by the NNAPI, and thus have to fall back to a CPU implementation, and models which can be hardware accelerated and operate on quantized INT8 or FP16 data. There’s also models relying on FP32 data, and here again depending on the underlying driver this can be either run on the CPU or for example on the GPU.

AIBenchmark - 1a - The Life - CPU AIBenchmark - 6 - Ms.Universe - CPU AIBenchmark - 7 - Berlin Driving - CPU

In the first set of workloads which I’ve categorised by being run on the CPU, we see the Snapdragon 855 perform well, although it’s not exactly extraordinary. Performance here is much more impacted by the scheduler of the system and exactly how fast the CPU is allowed to get to its maximum operating performance point, as the workload is of a short burst nature.

AIBenchmark - 1c - The Life - INT8 AIBenchmark - 3 - Pioneers - INT8 AIBenchmark - 5 - Cartoons - INT8

Moving onto the 8-bit integer quantised models, these are for most devices hardware accelerated. The Snapdragon 855’s performance here is leading in all benchmarks. In the Pioneers benchmark we more clearly see the doubling of the performance of the HVX units as the new hardware posts inference times little under half the time of the Snapdragon 845.

The Cartoons benchmark here is interesting as it showcases the API and driver aspect of NNAPI benchmarks: The Snapdragon 855 here seems to have massively better acceleration compared to its predecessors and competing devices. It might be that Qualcomm has notably improved its drivers here and is much better able to take advantage of the hardware, compared to the past chipset.

AIBenchmark - 1b - The Life - FP16 AIBenchmark - 2 - Zoo - FP16 AIBenchmark - 4 - Masterpiece - FP16

The FP16 workloads finally see some competition for Qualcomm as the Kirin’s NPU exposes support for its hardware here. Qualcomm should be running these workloads on the GPU, and here we see massive gains as the new platform’s NNAPI capability is much more mature.

AIBenchmark - 8 - Image Enhancement - FP32

The FP32 workload sees a similar improvement for the Snapdragon 855; here Qualcomm finally is able to take full advantage of GPU acceleration which gives the new chipset a considerable lead.

AIMark

Alongside AIBenchmark, it still might be useful to have comparisons with AIMark. This benchmark rather than using NNAPI, uses Qualcomm’s SNPE framework for acceleration. Also this gives us a rare comparison against Apple’s iPhones where the benchmark makes use of CoreML for acceleration.

鲁大师 / Master Lu - AImark - VGG16 鲁大师 / Master Lu - AImark - ResNet34 鲁大师 / Master Lu - AImark - Inception V3

Overall, the Snapdragon 855 is able to post 2.5-3x performance boosts over the Snapdragon 845.

At the event, Qualcomm also showcased an in-house benchmark running InceptionV3 which was accelerated by both the HVX units as well as the new Tensor block. Here the phone was able to achieve 148 inferences/s – which although maybe apples to oranges, represents a 26% boost compared to the same model run in AIMark.

Overall, even though the Tensor accelerator wasn’t directly tested in today’s benchmark results, the Snapdragon 855’s inference performance is outstanding due to the overall much improved driver stack as well as the doubling of the Hexagon’s vector execution units. It will be interesting to see what vendors do with this performance and we should definitely see some exciting camera applications in the future.

CPU Performance & Efficiency: SPEC2006 System Performance - Slightly Underwhelming?
Comments Locked

132 Comments

View All Comments

  • cknobman - Tuesday, January 15, 2019 - link

    So better power consumption but performance wise it looks like a swing and a miss.
    Nothing too meaningful over the 845.
  • IGTrading - Tuesday, January 15, 2019 - link

    To be honest, this is good enough for me and most of us.

    I'd be happy to see Qualcomm focusing more on server CPUs and computers/notebook running Windows on AMR chips.

    It's been something like 15 or even 20 years since coders/developers stopped worrying about optimizations, performance improvements and now they only rely on the much improvement hardware being available year after year.

    We were building optimized web pages 20 years ago, that looked good and loaded in less than 10 seconds on a 5,6 KB connection.

    Now idiots build sites where the Home Page is 300 MB heavy and complain about mobile CPUs and mobile networks not being fast enough.
  • bji - Tuesday, January 15, 2019 - link

    "t's been something like 15 or even 20 years since coders/developers stopped worrying about optimizations, performance improvements and now they only rely on the much improvement hardware being available year after year."

    Speaking as a software developer, I will say that your statement is bullshit. I have yet to work on any product where performance wasn't considered and efforts to improve efficiency and performance weren't made.
  • bji - Tuesday, January 15, 2019 - link

    Also everything your browser does now is 10,000 times more complicated than anything that browsers did 20 years ago. All of the effort that has gone into developing these technologies didn't go nowhere. You are just making false equivalencies.

    And if a page took 5 seconds to load in 2019, let alone 10 seconds, you'd be screaming about how terrible the experience is.
  • name99 - Tuesday, January 15, 2019 - link

    It's usually the case that people talking confidently about what computers were like 20 yrs ago (especially how they were faster than today...) are in the age range from not yet born to barely five years old at the relevant time.

    Those of us who lived through those years (and better yet, who still own machines from those years) have a rather more realistic view.
  • rrinker - Wednesday, January 16, 2019 - link

    Really? What's the 'realistic' view? For background, the first computer I had regular access to was a TRS-80 Model 1 when they first came out in 1977, so I've been doing this a LONG time. Software today is a bloated mess. It's not all the programmers' fault though, there is this pressing need for more and more features in each new version - features that you're lucky if 1% of the users actually even utilize. Web pages now auto start videos on load and also link a dozen ads from sites with questionable response times. That would have been unthinkable in the days 56k and slower dialup, and it just wasn't done. I even optimized my BBS in college - on campus we had (for the time) blazing fast 19.2k connections between all dorm rooms and the computing center, at a time when most people were lucky to have a 1200bps modem, and the really lucky ones had the new 2400s. So I set up my animated ANSI graphic signons in a way that on campus users at 19.2k would get the full experience and off campus users, connecting via the bank of 1200 baud modems we had, would get a simple plain text login. In today's world, there is a much grater speed disparity in internet connections. I have no problem with pretty much any site - but I have over 250mbps download on my home connection. Go visit family across the state - the best they can get a a DSL connection that nets about 500k on a good day on a speed test - and so many sites fail to load, or only ever partially load. But there are plenty of sites that don;t try to force graphics and videos down your throat that still work fine.
    No, things weren't faster back in the day - but because the resources were more limited, both for apps running on the local computer in terms of RAM, storage, and video performance as well as external connectivity, programs had to be more efficient. Heck, the first computer I actually owned had a whole 256 bytes of RAM - to do anything I had to be VERY efficient.
  • Klinky1984 - Friday, January 18, 2019 - link

    So pay per minute slow internet, the non-standard compliance of Netscape 2.0 and IE 3.0, an internet without any video streaming, were there "good ol days"? Sorry but I remember bloated pages that took a minute plus to download or never loaded. I remember waiting 3 minutes for one single high res jpeg to download... They were not glory days. Can your 256 byte computer even handle Unicode? No way.
  • seamadan - Tuesday, January 22, 2019 - link

    I bet your pages looked REALLY good. Like REALLY REALLY good. I'm in awe and I haven't even seen them
  • Krysto - Tuesday, January 15, 2019 - link

    That bold has sailed. They've already given all the server IP on a silver platter to their forced Chinese "partner".

    That said, Snapdragon 8cx for notebooks does look quite intriguing, mainly because of its 10MB shared cache.
  • Krysto - Tuesday, January 15, 2019 - link

    boat*

Log in

Don't have an account? Sign up now