It unsettled me a lot about just how much work was put into making the JavaScript version of this work instead of a purely Python version, due to how OpenCV works. I wonder how universal the laggy OpenCV thing is, because my friend faced it too when working on an OpenCV application. Is it so unavoidable that the only option is to not use Python? I really hope that there is another way of going about this.
Anyways, I am very glad that you put in all that effort to make the JavaScript version work well. Working under limitations is sometimes cool. I remember having to figure out how PyTorch evaluated neural networks, and having to convert the PyTorch neural network into Java code that could evaluate the model without any external libraries (it was very inefficient) for a Java code competition. Although there may have been a better way, what I did was good enough.
Creating a faster python implementation can definitely be done. OpenCV is a thin wrapper over the C++ API so it's not due to some intrinsic python slowness. It is not easy to resolve though and I suspect the way python code is typically written lends itself to an accidentally blocking operation more often than JS code. It's hard to know without seeing the code.
author here, sorry you have to see my janky JavaScript solution XD
but one good thing of going with Tauri is that developing the UI is pretty easy, since it's basically just some web pages, but with access to the system, through the JS <-> Rust communication.
also, rewriting neural network from PyTorch to Java sounds like a big task, I wonder if people are doing ML in Java
This is cool, but a moving average filter is pretty bad at removing noise - it tends to be longer than it needs to be because its passband is so bad. Try using a IIR filter instead. You don't need to deal with calculating the coefficients correctly because they'll just be empirically determined.
out = last_out * x + input * (1-x)
Where x is between zero and one. Closer to one, the more filtering you'll do. You can cascade these too, to make a higher order filter, which will work even better.
i've heard good things about using the 1 euro filter for user input related tasks, where you're trying to effectively remove noise, but also keep latency down.
That sounds very interesting. I've been needing a filter to deal with noisy A/D conversions for pots in an audio project. Noise on a volume control turns into noise on the output, and sounds horrible, but excessive filtering causes unpleasant latency when using the dials.
Interesting, never heard of the IIR filter before, will keep in mind as one of the options if I ever worked with removing noise again, thanks for sharing!
You are already using the IIR filter as part of one-euro filter. The 1€ filter is an adaptive filter that uses first-order IIR, also called exponential filter as its bases. Depending on your filtering parameters you can turn off the adaptive part and you are left with just the IIR.
Amazing work! I have been working on robotifying operation task for my company - a robot hand and a vision that can complete a task on the monitor just like humans do. Have been toying with openAI vision model to get the mouse coordinates but it’s slow and does not return the correct coordinates always (probably due to LLM not understanding geometry)
Anyhow , looking forward to try your approach with mediapipe. Thanks for the write up and demo, inspirational.
I did a very similar project a few months back. My goal was to help alleviate some of the RSI issues I have, and give myself a different input device.
The precision was always tricky, and while fun, i eventually abandoned the project and switched to face tracking and blinking so i didn't have to hold up my hand.
For some reason the idea of pointing my webcam down, didn't dawn on me ever. I then discovered Project Gameface and just started using that.
Happy programming thank you for the excellent write up and read!
Glad you enjoyed reading it!
I just checked Project Gameface demo [1], and really cool that it is accurate enough for drawing text, I wonder what it is tracking. Are you still using it?
I'm curious how your experience is using Gameface for day-to-day tasks like coding. I assume you still use a keyboard for typing, but what about selecting blocks of text or general navigation?
Similar situation here, super interested in hearing how well gameface works for you. Do you use it for non-gaming as well?
I've succeeded in fully replacing the keyboard (I use Talon voice) but find replacing the mouse tougher. Tried eyetracking but could never get it accurate enough not to be frustrating.
Very nice! The sort of thing that I expect to see on HN. Do you currently use it? I mean maybe is not perfect for a mouse replacement but as a remote movie control as shown in one of the last videos is definitely a legit use case. Congrats!
I'm glad it is up to the HN standard :)
No, I don't currently use it, I am back on mouse and touchpad, but I can definitely see what you mean by remote movie control. I would love to control my movie projector with my hand.
I've been thinking on and off on how to improve the forward facing mode. Since having the hand straight ahead of the camera is messing with the readings, I think the MediaPipe is trained on seeing the hand from above or below (and maybe sides) but not straight ahead.
Ideally, the camera should be like kind of above the hand (pointing downwards) to get the best results. But in the current version of downward facing mode, the way to move the cursor is actually by moving the hand around (x and y position of the hand translates to x and y of the cursor). If the camera FOV is very big (capturing from far away), then you would have to move your hand very far in order to move the cursor, which is probably not ideal.
I later found the idea of improvement for this when playing around with a smart TV, where the remote is controlling a cursor. We do that by tilting the remote like up and down or left and right, I think it uses gyroscope or accelerometer (idk which is which). I wish I have a video of it to show it better, but I don't. I think it is possible to apply the same concept here to the hand tracking, so we use the tilt of the hand for controlling the cursor. This way, we don't have to rely on the hand position captured by the camera. Plus, this will work if the camera is far away, since it is only detecting the hand tilt. Still thinking about this.
Anyway, I'm glad you find the article interesting!
I tried to implement Johnny Lee's amazing idea (https://www.youtube.com/watch?v=Jd3-eiid-Uw) using mediapipe face tracking. I could not move much far using simple webcams since it was getting difficult to determine the distance of the face from the camera when the face was turned. I had an Inel RealSense 415 depth tracking camera from a different project and it took care of the distance thing at least. But the jitter thing had me stumped for a long time and I put the project away. With your ideas, I get the strength to revisit it. Thanks!
> Python version is super laggy, something to do with OpenCV
Most probably I'm wrong, but I wonder if it has anything to do with all the text being written to stdout. In the odd chance that it happens on the same thread, it might be blocking.
Hmm, I couldn't remember if I tried it without the text being written to stdout. But that's an interesting point, I just didn't expect the print() blocking to be significant.
I’m not sure what your reasoning is, but note that blocking I/O including print() releases the GIL. (So your seemingly innocent debugging print can be extremely not harmless under the wrong circumstances.)
A great demo, but how I wish there was a keyboard-less method for words input based on swipe-typing, meaning I do not press virtual keys, I just wave my index finger in the air, and the vision pick ups the path traces and converts them for words. Well, if there's something else asking for even less effort, maybe even something that's already implemented - I am all open to suggestions!
That made me smirk, but I am curious, "What would be the best color for general webcam colored-object tracking?" I'm sure it would depend on the sensor, but I wonder if one color would be best for the most basic hardware.
Something not found in the background. If you can cleanly segment the image purely on color, that makes the object tracking very very easy if you're tracking a single object.
Mediapipe makes hand tracking so easy and it looks SO cool. I did a demo at PyData NYC a couple of years ago that let you rotate a Plotly 3D plot using your hand:
One suggestion for fixing the cursor drift during finger taps is instead of using hand position, use index finger. Then tap the middle finger to the thumb for selection. Since this doesn’t change the cursor position, yet is still a comfortable and easy to parse action.
Thanks Herman, glad you enjoyed it!
I agree with your suggestion, having the middle finger + thumb for tap and index finger for the movement will mitigate the cursor drift. The only reason I used index finger + thumb is so that it is like the Apple Vision Pro input.
But definitely could be an improvement.
Unrelated, but shoutout to bearblog. My first blog was on bearblog, which made me start writing. Although I later ended up self-hosting my own blog.
Such a cool and inspirational project! Regarding the drift on pinch, have you tried storing the pointer position of the last second and use that as the click position? You could show this position as a second cursor maybe? I've always wondered why Apple doesn't do this for their "eye moves faster than hands" issue as well.
Related online demo on using mediapipe for flying spaceships and camera/hand interaction to grab VR cubes (2nd link for the demo). There was a discussion on hackaday recently [2].
An inspiring project. I am looking forward to see some gloves connected to a VR device. I think that some cheap sensors, a bit of bayesian modelling and a calibration step can offer a proper realtime hand gesture tracking.* I am already picturing being able to type on a AR keyboard. If the gloves are more expansive there might be some haptic feedbacks. VR devices might have more open OSes in the future or could use a "streaming" platform to access remote desktop environments. I am eager to see all the incoming use cases!
*: a lot of it. Plus, the tracking might be task-centered. I would not bet on a general hand gesture tracking with cheap sensors and bayesian modelling only.
Tap (tapwithus.com) had a IMU-based solution early on in the current VR hype cycle using a IMU for each finger and some kind of chord-based letter typing system. Was a fancy proof of your geekiness to wear them during VR meetups back then.
I think they have a camera-based wristband version now.
Still doesn't have any room positioning info though, AFAIK.
Just because of the use case, and me not having used it in an AR app while wanting to, I'd like to point to doublepoint.com 's totally different but great working approach where they trained a NN to interpret a Samsung Watch's IMU data to detect taps. They also
added a mouse mode.
I think Google's OS also allows client BT mode for the device, so I think it can be paired directly as a HID, IIRC.
Not affiliated, but impressed by the funding they received :)
So cool! I was just wondering the other day if it would be possible to build this! For front facing mode, I wonder if you could add a brief “calibration” step to help it learn the correct scale and adjust angles, e.g. give users a few targets to hit on the screen
Hi Jacob, thanks for checking it out. Regarding the calibration step for front facing mode, I'm glad you brought this up. I did think of this, because the distance from the camera/screen to the hand affect the movement so much (the part where the angle of the hand is part of the position calculation).
And you are absolutely right regarding its use for the correct scale. For my implementation, I actually just hardcoded the calibration values, based on where I want the boundaries for the Z axis. This value I got from the reading, so in a way it's like a manual calibration. :D
But having calibration is definitely the right idea, I just didn't want to overcomplicate things at that time.
BTW, I am a happy user of Exponent, thanks for making it! I am doing some courses and also peer mocks for interview prep!
If compelling enough I don't mind setting up a downward facing camera. Would like to see some more examples though where it shows some supremacy over just using a mouse. I'm sure there are some scenarios where it is.
It unsettled me a lot about just how much work was put into making the JavaScript version of this work instead of a purely Python version, due to how OpenCV works. I wonder how universal the laggy OpenCV thing is, because my friend faced it too when working on an OpenCV application. Is it so unavoidable that the only option is to not use Python? I really hope that there is another way of going about this.
Anyways, I am very glad that you put in all that effort to make the JavaScript version work well. Working under limitations is sometimes cool. I remember having to figure out how PyTorch evaluated neural networks, and having to convert the PyTorch neural network into Java code that could evaluate the model without any external libraries (it was very inefficient) for a Java code competition. Although there may have been a better way, what I did was good enough.
Creating a faster python implementation can definitely be done. OpenCV is a thin wrapper over the C++ API so it's not due to some intrinsic python slowness. It is not easy to resolve though and I suspect the way python code is typically written lends itself to an accidentally blocking operation more often than JS code. It's hard to know without seeing the code.
author here, sorry you have to see my janky JavaScript solution XD but one good thing of going with Tauri is that developing the UI is pretty easy, since it's basically just some web pages, but with access to the system, through the JS <-> Rust communication.
also, rewriting neural network from PyTorch to Java sounds like a big task, I wonder if people are doing ML in Java
This is cool, but a moving average filter is pretty bad at removing noise - it tends to be longer than it needs to be because its passband is so bad. Try using a IIR filter instead. You don't need to deal with calculating the coefficients correctly because they'll just be empirically determined.
out = last_out * x + input * (1-x)
Where x is between zero and one. Closer to one, the more filtering you'll do. You can cascade these too, to make a higher order filter, which will work even better.
i've heard good things about using the 1 euro filter for user input related tasks, where you're trying to effectively remove noise, but also keep latency down.
see https://gery.casiez.net/1euro/ with plenty of existing implementations to pick from
That sounds very interesting. I've been needing a filter to deal with noisy A/D conversions for pots in an audio project. Noise on a volume control turns into noise on the output, and sounds horrible, but excessive filtering causes unpleasant latency when using the dials.
Interesting, never heard of the IIR filter before, will keep in mind as one of the options if I ever worked with removing noise again, thanks for sharing!
You are already using the IIR filter as part of one-euro filter. The 1€ filter is an adaptive filter that uses first-order IIR, also called exponential filter as its bases. Depending on your filtering parameters you can turn off the adaptive part and you are left with just the IIR.
Mediapipe is a lot of fun to play with and I'm surprised how little it seems to be used.
You might also be interested in Project Gameface, open source Windows and Android software for face input: https://github.com/google/project-gameface
Also https://github.com/takeyamayuki/NonMouse
Probably because the API is written like enterprise Java garbage
Amazing work! I have been working on robotifying operation task for my company - a robot hand and a vision that can complete a task on the monitor just like humans do. Have been toying with openAI vision model to get the mouse coordinates but it’s slow and does not return the correct coordinates always (probably due to LLM not understanding geometry)
Anyhow , looking forward to try your approach with mediapipe. Thanks for the write up and demo, inspirational.
I did a very similar project a few months back. My goal was to help alleviate some of the RSI issues I have, and give myself a different input device.
The precision was always tricky, and while fun, i eventually abandoned the project and switched to face tracking and blinking so i didn't have to hold up my hand.
For some reason the idea of pointing my webcam down, didn't dawn on me ever. I then discovered Project Gameface and just started using that.
Happy programming thank you for the excellent write up and read!
Glad you enjoyed reading it! I just checked Project Gameface demo [1], and really cool that it is accurate enough for drawing text, I wonder what it is tracking. Are you still using it?
[1] https://blog.google/technology/ai/google-project-gameface/
I'm curious how your experience is using Gameface for day-to-day tasks like coding. I assume you still use a keyboard for typing, but what about selecting blocks of text or general navigation?
Similar situation here, super interested in hearing how well gameface works for you. Do you use it for non-gaming as well?
I've succeeded in fully replacing the keyboard (I use Talon voice) but find replacing the mouse tougher. Tried eyetracking but could never get it accurate enough not to be frustrating.
Very nice! The sort of thing that I expect to see on HN. Do you currently use it? I mean maybe is not perfect for a mouse replacement but as a remote movie control as shown in one of the last videos is definitely a legit use case. Congrats!
I'm glad it is up to the HN standard :) No, I don't currently use it, I am back on mouse and touchpad, but I can definitely see what you mean by remote movie control. I would love to control my movie projector with my hand.
I've been thinking on and off on how to improve the forward facing mode. Since having the hand straight ahead of the camera is messing with the readings, I think the MediaPipe is trained on seeing the hand from above or below (and maybe sides) but not straight ahead.
Ideally, the camera should be like kind of above the hand (pointing downwards) to get the best results. But in the current version of downward facing mode, the way to move the cursor is actually by moving the hand around (x and y position of the hand translates to x and y of the cursor). If the camera FOV is very big (capturing from far away), then you would have to move your hand very far in order to move the cursor, which is probably not ideal.
I later found the idea of improvement for this when playing around with a smart TV, where the remote is controlling a cursor. We do that by tilting the remote like up and down or left and right, I think it uses gyroscope or accelerometer (idk which is which). I wish I have a video of it to show it better, but I don't. I think it is possible to apply the same concept here to the hand tracking, so we use the tilt of the hand for controlling the cursor. This way, we don't have to rely on the hand position captured by the camera. Plus, this will work if the camera is far away, since it is only detecting the hand tilt. Still thinking about this.
Anyway, I'm glad you find the article interesting!
I tried to implement Johnny Lee's amazing idea (https://www.youtube.com/watch?v=Jd3-eiid-Uw) using mediapipe face tracking. I could not move much far using simple webcams since it was getting difficult to determine the distance of the face from the camera when the face was turned. I had an Inel RealSense 415 depth tracking camera from a different project and it took care of the distance thing at least. But the jitter thing had me stumped for a long time and I put the project away. With your ideas, I get the strength to revisit it. Thanks!
> Python version is super laggy, something to do with OpenCV
Most probably I'm wrong, but I wonder if it has anything to do with all the text being written to stdout. In the odd chance that it happens on the same thread, it might be blocking.
Hmm, I couldn't remember if I tried it without the text being written to stdout. But that's an interesting point, I just didn't expect the print() blocking to be significant.
Could it then be resolved by using the no-gil version of python they just released?
I’m not sure what your reasoning is, but note that blocking I/O including print() releases the GIL. (So your seemingly innocent debugging print can be extremely not harmless under the wrong circumstances.)
A great demo, but how I wish there was a keyboard-less method for words input based on swipe-typing, meaning I do not press virtual keys, I just wave my index finger in the air, and the vision pick ups the path traces and converts them for words. Well, if there's something else asking for even less effort, maybe even something that's already implemented - I am all open to suggestions!
Some problems in life can be easily fixed with crimson red nail polish.
That made me smirk, but I am curious, "What would be the best color for general webcam colored-object tracking?" I'm sure it would depend on the sensor, but I wonder if one color would be best for the most basic hardware.
Something not found in the background. If you can cleanly segment the image purely on color, that makes the object tracking very very easy if you're tracking a single object.
Mediapipe makes hand tracking so easy and it looks SO cool. I did a demo at PyData NYC a couple of years ago that let you rotate a Plotly 3D plot using your hand:
https://youtu.be/ijRBbtT2tgc?si=2jhYLONw0nCNfs65&t=1453
Source: https://github.com/jcheng5/brownian
That demo is pretty impressive!
This is a very cool demo! Well done!
One suggestion for fixing the cursor drift during finger taps is instead of using hand position, use index finger. Then tap the middle finger to the thumb for selection. Since this doesn’t change the cursor position, yet is still a comfortable and easy to parse action.
Thanks Herman, glad you enjoyed it! I agree with your suggestion, having the middle finger + thumb for tap and index finger for the movement will mitigate the cursor drift. The only reason I used index finger + thumb is so that it is like the Apple Vision Pro input. But definitely could be an improvement.
Unrelated, but shoutout to bearblog. My first blog was on bearblog, which made me start writing. Although I later ended up self-hosting my own blog.
Such a cool and inspirational project! Regarding the drift on pinch, have you tried storing the pointer position of the last second and use that as the click position? You could show this position as a second cursor maybe? I've always wondered why Apple doesn't do this for their "eye moves faster than hands" issue as well.
Related online demo on using mediapipe for flying spaceships and camera/hand interaction to grab VR cubes (2nd link for the demo). There was a discussion on hackaday recently [2].
[0] https://tympanus.net/codrops/2024/10/24/creating-a-3d-hand-c...
[1] https://tympanus.net/Tutorials/webcam-3D-handcontrols/
[2] [https://hackaday.com/2024/10/25/diy-3d-hand-controller-using... DIY 3d hand controller
It's projects like this that really make me want to start on a virtual theremin. Wish I had the time :(
My son did a basic version for a class project, surprisingly simple with MediaPipe
https://s-ocheng.github.io/theremin/
https://github.com/s-ocheng/theremin
Oh that's an awesome idea!
An inspiring project. I am looking forward to see some gloves connected to a VR device. I think that some cheap sensors, a bit of bayesian modelling and a calibration step can offer a proper realtime hand gesture tracking.* I am already picturing being able to type on a AR keyboard. If the gloves are more expansive there might be some haptic feedbacks. VR devices might have more open OSes in the future or could use a "streaming" platform to access remote desktop environments. I am eager to see all the incoming use cases!
*: a lot of it. Plus, the tracking might be task-centered. I would not bet on a general hand gesture tracking with cheap sensors and bayesian modelling only.
Tap (tapwithus.com) had a IMU-based solution early on in the current VR hype cycle using a IMU for each finger and some kind of chord-based letter typing system. Was a fancy proof of your geekiness to wear them during VR meetups back then.
I think they have a camera-based wristband version now.
Still doesn't have any room positioning info though, AFAIK.
Cool path and write-up. Thank you!
Just because of the use case, and me not having used it in an AR app while wanting to, I'd like to point to doublepoint.com 's totally different but great working approach where they trained a NN to interpret a Samsung Watch's IMU data to detect taps. They also added a mouse mode.
I think Google's OS also allows client BT mode for the device, so I think it can be paired directly as a HID, IIRC.
Not affiliated, but impressed by the funding they received :)
Wow interesting, reminded me of that Meta Orion wristband, I wonder if that is the goal.
Remeinds me of the Leap Motion controller, now there's a version 2: https://leap2.ultraleap.com/downloads/leap-motion-controller...
So cool! I was just wondering the other day if it would be possible to build this! For front facing mode, I wonder if you could add a brief “calibration” step to help it learn the correct scale and adjust angles, e.g. give users a few targets to hit on the screen
Hi Jacob, thanks for checking it out. Regarding the calibration step for front facing mode, I'm glad you brought this up. I did think of this, because the distance from the camera/screen to the hand affect the movement so much (the part where the angle of the hand is part of the position calculation).
And you are absolutely right regarding its use for the correct scale. For my implementation, I actually just hardcoded the calibration values, based on where I want the boundaries for the Z axis. This value I got from the reading, so in a way it's like a manual calibration. :D But having calibration is definitely the right idea, I just didn't want to overcomplicate things at that time.
BTW, I am a happy user of Exponent, thanks for making it! I am doing some courses and also peer mocks for interview prep!
If compelling enough I don't mind setting up a downward facing camera. Would like to see some more examples though where it shows some supremacy over just using a mouse. I'm sure there are some scenarios where it is.
This has tons of potential in the creative technology space. Thanks for sharing!
This is very cool - can you do window focus based on the window I am looking at next? :)
Man, I feel making diagrams / writing handwritten notes will be great with this!
Very impressive! This opens up a whole new set of usages for this headset
erm i snuk in hackers news i kid erm what the sigmwa
could this be the next evolution of gaming mice?