Native rendering API / or just use a cardboard engine?

Hi aryzon team!

Your efforts seem to be really awsome, but I am wondering what is the way of doing away without unity. I have an idea about a much-much faster AR tracking algorithm optimized for the single scan of the camera frame without any complicated computer vision post-processing and maybe making an specialized engine that is built for AR from the beginning - instead of using stuff like unity which is great but might do more awsome actually.

What are the possibilities for this? I am very quite knowledged in doing AR with Unity and c++ natively and would happily take the second path for a higher scale, mass-market tool like Aryzon can be.

What I hope for is a simple or modular native SDK. Actually I am only interested in how the rendering is being done and if there are any specific algoritms for you. Vuforia I do not need, neither I need higher level stuff.

Keep up the good work!
Best regards,
Prenex

Hi Prenex,

We love the idea of a faster AR tracking algorithm, can you elaborate a bit more on the functionality and accuracy of the tracker? Will it still be able to do 3D pose tracking? We have been working on our own tracker but gave it up for the time being since other tracking engines work just as well or better. There is ongoing progress in the monocular pose estimation field, but what we ran into are two things:

  1. phone cameras have shutter lag and lens distortion making them harder to work with
  2. the speed of the phone itself. Dropping a frame could mean much less accuracy.

Of course there are a lot more issues to be solved but don’t want to discourage you too much.

ARKit (and probably ARCore) have been doing a great job at creating a very good tracking engine on a very few select range of phones. So I see big potential for a pose tracker that at least works really well for the functionality it has on the lower to mid end Android phones.

If you need help please don’t hesitate, just ask :slight_smile:

Maarten

Hi Maarten!

The algorithm is a little bit different than usual and is not working in ways that find shapes and apply a combination of CV filters, but just one very tricky first pass that scans the whole frame and a second pass that collects the data. These can run in every frame or in some of the frames. There is no different engines for finding the markers and to track them - there is only one tracking engine that is low-latency and should work immediately.

It is originally designed to track a lot of 2D markers with fast speed and very low time to start the tracking. It is up to the user to decide whatever they do with the results of this. If you want to have vuforia-like experience and add the 3D pose tracking, then you need to use more than one of these small markers and triangulate. If you do not need a small marker to anchor the objects, but for example you want to measure the exact location of the viewer in a hall, then you better not do it like a vuforia marker, but use as many 2D markers in patterns as you want to and not only know where you are, but maybe even map walls around you by knowing what patterns you are expecting.

I am not sure yet how much information you can code into the marker itself, but it seems to be really limited with my current approach. When more information is needed, you combine more than one markers - but the markers I am talking about here, look very simple and should work even when they are small.

Strengths:

  • Scales well for a lot of markers and I am not talking about tracking 10 of them simultanously, but much more :wink:
  • There is no different recognizer and tracker. The tracker also recognizes and it has very low (close to zero) latency between first seeing the marker and starting to track it.
  • The markers are really simple-looking, should be easy to track well with good quality
  • Low resources are needed because of simple and low level operations, few passes and good asymptotic runtime.
  • Scales well for tracking whole walls and maybe movements in a big (more than one room) environment should be possible.

Weaknesses:

  • For encoding data in a tracked pose usually more than one marker is needed
  • For pose estimation more than one marker is needed + geometry calculations
  • I was not doing any lens distortion handling yet - methods should be added from open source things like ARToolkit5 or something. I think this must be some fast or constant-time operation maybe when done well, but I have never tried it…
  • It is not done yet :smiley: I think this is the worst of the weaknesses as most of the information provided here are just speculations out from my head about how it will work.

I hope I can answer some of your questions with this. Could you tell me (and possibly others) what kind of native support we could get? It would be awsome to start experimenting and stuff.

PS.: And one more weakness is that it is not an NFT tracker. It strictly tracks special markers and is not able to learn to track features of an image or brossure. If you want to apply it on a brocchure then you put the ad contents like photos and text in the middle and the small 3-4 markers all around that business content on the sides of the paper for the pose tracking. Also the virtual-button like thing you might know from vuforia can work only with the special markers if that needs to be done but I was not having that function in so much focus actually. Surely it can be done though if that is needed.

I know that it seems there are a lot of constraints, but they are needed to make the stuff fast.

PS.: And one more strength however is that I try to make this as stand-alone as it can be. I hate when I need to add a whole OpenCV subsystem and compile a lot of customized computer vision libraries just to use a simple tracker sometimes. I have been working with vuforia and ARTK5 though and I think both of them have even hand-made assembly optimizations for ARM so to make them faster just I think it is not the details, but the algorithm that can be changed as both do a lot of things under the cover and it is no wonder that they need a strong phone or otherwise eat most of the resources.