/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality.

Build Back Better

More updates on the way. -r

Max message length: 6144

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


Have a nice day, Anon!


Robot Vision General Robowaifu Technician 09/11/2019 (Wed) 01:13:09 No.97
Cameras, Lenses, Actuators, Control Systems

Unless you want to deck out you're waifubot in dark glasses and a white cane, learning about vision systems is a good idea. Please post resources here.

opencv.org/
https://archive.is/7dFuu

github.com/opencv/opencv
https://archive.is/PEFzq

www.robotshop.com/en/cameras-vision-sensors.html
https://archive.is/7ESmt
Edited last time by Chobitsu on 09/11/2019 (Wed) 01:14:45.
>>10995 Thank you kindly Anon.
Open file (1.05 MB 1758x504 modulo_teaser.png)
I already mentioned there's open source software that can see heartbeats and micro facial movements called Eulerian Video Magnification in another thread, but I figure I might as well mention it here too: https://www.youtube.com/watch?v=ONZcjs1Pjmk As for hardware, I was thinking simple solid black camera eyes if I couldn't get 3-axis movement (vertical, horizontal, convergence) working with cameras without taking up too much space. I haven't thought too much about eyes other than that, except that I was thinking she should see in IR, to help in poor lighting without blinding me with LEDs on her face. I remember reading about something called a Modulo Camera that supposedly never over-exposes or something, so maybe it could just use a bigger camera for better night vision? There's also something called a "Light field camera" that keeps everything in focus, but I'm not sure how useful that is for robot vision, I just think it's neat.
>>13163 That's an interesting concept Anon, thanks. Yes, I think cameras and image analysis have very long legs yet, and we still have several orders of magnitude improvements yet to come in the future. It would be nice if our robowaifus (and not just our enemies) can take advantage of this for us. We need to really be thinking ahead in this area tbh.
It seems like CMOS is the default sensor for most CV applications due to cost. But seeing all these beautiful eye designs makes me consider carefully how those photons get processed into signal for the robowaifus. Cost aside, CCD as a technology seems better because the entire image is processed monolithically, as one crisp frame, instead of a huge array of individual pixel sensors, which I think causes noise which has to be dealt with in post image processing. CCD looks like its still the go-to for scientific instruments today. In astrophotography everyone drools over cameras with CCD, while CMOS is -ok- and fits most amateur needs, the pros use CCD. Astrophotography / scientific www.atik-cameras(dot)com/news/difference-between-ccd-cmos-sensors/ This article breaks it down pretty well from a strictly CV standpoint. www.adimec(dot)com/ccd-vs-cmos-image-sensors-in-machine-vision-cameras/
>>14751 That looks very cool Anon. I think you're right about CCDs being very good sensor tech. Certainly I think that if we can find ones that suit our specific mobile robowaifu design needs, then that would certainly be a great choice. Thanks for the post!
iLab Neuromorphic Vision C++ Toolkit The USC iLab is headed up by the PhD behind the Jevois cameras and systems. http://ilab.usc.edu/toolkit/
>(>>15997, ... loosely related)
>"Follow Me" eyes (crosslink): >>19037 - I somehow forgot that we had a dedicated thread for eyes.
> conversation-related (>>23398, ...)
Related: >>23405 >Once thing I would like to do with a board that allows for more than one camera, would be to have a way to use this for creating a somewhat 3D model of the world. Especially be able to know the distance of an object it recognizes. This will be absolutely crucial to understand the world. >>23410 >Stereo Depth Cameras ... using triangulation >>23431 > auto-mesh generation Is this about generating meshes from 2D pictures. I just wrote somewhere that I wonder how video to 3D model would work. It's possible to use AI generated videos to feed a game engine and render a even better video. I guess the background is done using this "auto-mesh generation" then (pose estimation to bone model for characters).
>>23431 >Motion isn't req'd. when you already know the dimensions of the object, you can only use a single image when you already know the actual height then its just a matter of measuring the difference between the real height vs the image height, its how snipers have to figure out distances in their scope when they dont have a rangefinder, it would need to keep a database of dimensions for known objects otherwise it has to go into pajeetmode to emulate stereoscopic vision, its doable but it seems like too much hassle when you can just use two cameras
>>23436 >it would need to keep a database of dimensions for known objects otherwise it has to go into pajeetmode to emulate stereoscopic vision Agreed, and that's an aspect of the 'well-calibrated' camera(s) part of the equation. For instance, when a robowaifu can remain in the relative safety of her master's home, then she can have the luxury of perfectly pre-learning basically everything in his space. This is a big win for all of her on-the-fly, object recognition/distance/volume/kinematic/mass/force/pose -estimate calculations. Including him, of course. :^) >"Master!? Have you been putting on weight again?" >=== -prose edit, fmt -add funpost spoilers
Edited last time by Chobitsu on 06/24/2023 (Sat) 23:34:03.
>>23435 >Is this about generating meshes from 2D pictures. Yes. It works far better using a combination of stereo depth cameras, and the ability to proactively transform the camera(s) around the object(s) in question. Much like a robowaifu (or a human photographer) would be able to do. The primary point being to highly-accurately model the world around her, including her own master and other humans. (For example: their own children romping about. :^)
>>23437 lol, i forgot it would already need a database anyway for those things, still calculating based on parallax is way simpler than comparing the image to known dimensions, especially if the object is rotated then you need to know the angle to get a real height to compare to
>>23440 >parallax is way simpler This comment touches on a technical aspect of computation, and the up-front costs involved with setup. But fair enough Anon. I'm sure it's more reliable, in general, than simplistic dimension analysis, particularly in tricky lighting conditions.
>>23436 >need to keep a database of dimensions for known objects Thanks for pointing this out, but that's something I want to do anyways. Robowaifus should have a rough estimate on the traits of identifiable objects, e.g. weight and size. This can be taken from some public databases or LLMs. Then on top of that, if they see something close to an unknown object, which they can identify, they should be able to draw a conclusion about the size of the unknown one based on that.
Open file (97.63 KB 1202x743 yolo_nas_frontier.png)
>Deci is thrilled to announce the release of a new object detection model, YOLO-NAS - a game-changer in the world of object detection, providing superior real-time object detection capabilities and production-ready performance. Deci's mission is to provide AI teams with tools to remove development barriers and attain efficient inference performance more quickly. https://github.com/Deci-AI/super-gradients/blob/master/YOLONAS.md
>>23776 what is mAP?
>>23777 The diagram indicates it's a form of accuracy to compare such models. >Mean Average Precision (mAP) https://blog.paperspace.com/mean-average-precision/ Found via: https://duckduckgo.com/?q=map+machine+learning+accuracy
>>23776 Very low-latency in detection is vital, insofar as her autonomous safety is concerned. The ideal is human-level speed at object recognition (or even faster). We're probably getting pretty close on smol devices already, so I predict we'll reach this goal generally by the time the first real-world robowaifus begin rolling out. Thanks Anon.
>>24909 - the computers connected to the eyes (cameras) should have different ways of sharing data with other computers, e.g. just sharing body movement analysis and recognition info as a text stream, same for the person being detected, or some emotional indicators. Sending photos and videos should be very limited, only sending encrypted files, also the system should mostly not store this data. Some home server might store and process some data for fine tuning, but needs to receive this data encrypted. Decision what to share should be made based on overall context coming from the general cognitive architecture >>24783 - fast and efficient segmentation of images (FPGAs?) - different variants or the same image, created very fast, maybe using FPGA. For further processing, e.g. only processing a low res partial image of an object to keep track of. The creation of that low res partial image should be done by a specialized system close to the cameras. - using object detection models based on context informed by the general cognitive architecture >>24783 or just based on awareness of what room she's in and maybe even at what she's looking at. So they can be smaller, faster and more specialized, including some models which are trained on the specific training data related to the household (photos and videos of the home environment).
Open file (215.89 KB 869x350 Screenshot_114.png)
Open file (326.25 KB 879x492 Screenshot_113.png)
Open file (162.74 KB 878x396 Screenshot_112.png)
>LERF optimizes a dense, multi-scale language 3D field by volume rendering CLIP embeddings along training rays, supervising these embeddings with multi-scale CLIP features across multi-view training images. After optimization, LERF can extract 3D relevancy maps for language queries interactively in real-time. LERF enables pixel-aligned queries of the distilled 3D CLIP embeddings without relying on region proposals, masks, or fine-tuning, supporting long-tail open-vocabulary queries hierarchically across the volume. >With multi-view supervision, 3D CLIP embeddings are more robust to occlusion and viewpoint changes than 2D CLIP embeddings. 3D CLIP embeddings also conform better to the 3D scene structure, giving them a crisper appearance. https://www.lerf.io https://github.com/kerrj/lerf https://drive.google.com/drive/folders/1vh0mSl7v29yaGsxleadcj-LCZOE_WEWB?usp=sharing https://arxiv.org/abs/2303.09553
> Face recognition Not tested, just looking what's available: https://github.com/cmusatyalab/openface Following quotes are from Reddit, not from me... https://github.com/ageitgey/face_recognition > I have tried this out. It's easy to code and accurately recognizes faces. The problem is it can't even detect faces 1 feet away from the camera. https://github.com/timesler/facenet-pytorch (FaceNet & MTCNN) > This can detect and recognize faces at a distance, but the problem is it can't recognize unknown faces correctly. I mean for unknown faces it always tries to label it as one of the faces from the model/ database encodings. https://github.com/serengil/deepface > I have tried VGG, ArcFace, Facenet512. The latter two gave me good results. But, the problem is I couldn't figure out how to change the detection from every 5 seconds to real-time. Also, I couldn't change the camera source. (If anyone can help me with these please do). Also, it had fps drops frequently. https://github.com/deepinsight/insightface > Couldn't test this yet. But in the demo YT video it shows the model incorrectly detecting a random object as a face. If someone knows how well this performs please let me know. https://www.reddit.com/r/computervision/comments/15ycwom/face_recognition_whats_the_state_of_the_art/ This here seems to be the best: https://github.com/ZoneMinder/zoneminder the Reddit link above has some thread and patch for detecting faces on distance, I think.
Open file (537.86 KB 877x878 LLaVA.png)
LLaVA: Large Language and Vision Assistant (https://llava-vl.github.io/) A project to integrate vision into large language models. Though very new and young as a concept, adding visual context to language models has tremendous potential. Notably, a waifu which can understand correlations between what she perceives in her environment with what she is told can lead to much more naturally feeling interactions. Fingers crossed for a fork that implements YOLO (https://pjreddie.com/darknet/yolo/) rather than CLIP (https://openai.com/research/clip) for better compute and memory efficiency. Getting this to run at sub 10 watts should be a goal.
Edited last time by Kiwi_ on 10/11/2023 (Wed) 18:33:59.
Open file (82.60 KB 386x290 Screenshot_158.png)
I was working on this here >>26112 using OpenCL to make video processing faster. So I got this here recommended by YouTube: https://www.youtu.be/0Kgm_aLunAo Github: https://github.com/jjmlovesgit/pipcounter This is using OpenCV to count pips on dominos, and does it much faster and better than GPT4-Vision. I wonder if it would be possible to have a LLM adjust the code dependent on the use case, and maybe having a library of common patterns to look out for. Ideally one would show it something new, it would detect the outer border like the stones here and then adjust till it can catch the details on all of these objects which are of interest. It could look out for patterns dependent on some context, like e.g. a desk.
>>26132 >and does it much faster and better than GPT4-Vision. Doesn't really surprise me. OpenCV is roughly the SoA in hand-written C++ code for computer vision. You have some great posts ITT Anon thanks... keep up the good work! :^)
There are several libraries and approaches that attempt to achieve generalized object detection within a context, although creating a completely automatic, context-based object detection system without predefining objects can be a complex task due to the variability of real-world scenarios. However, libraries and methodologies that have been utilized for more general object detection include: 1. YOLO (You Only Look Once): YOLO is a popular object detection system that doesn't require predefining objects in the training phase. It uses a single neural network to identify objects within an image and can detect multiple objects in real-time. However, it typically requires training on specific object categories. 2. OpenCV with Haar Cascades and HOG (Histogram of Oriented Gradients): OpenCV provides Haar cascades and HOG-based object detection methods. While not entirely context-based, they allow for object detection using predefined patterns and features. These methods can be more general but might not adapt well to various contexts without specific training or feature engineering. 3. TensorFlow Object Detection API: TensorFlow offers an object detection API that provides pre-trained models for various objects. While not entirely context-based, these models are designed to detect general objects and can be customized or fine-tuned for specific contexts. 4. Custom Object Detection Models with Transfer Learning: You could create a custom object detection model using transfer learning from a pre-trained model like Faster R-CNN, SSD, or Mask R-CNN. By fine-tuning on your own dataset, the model could adapt to specific contexts. 5. Generalized Shape Detection Algorithms: Libraries like scikit-image and skimage in Python provide various tools for general image processing and shape analysis, including contour detection, edge detection, and morphological operations. While not object-specific, they offer tools for identifying shapes within images. Each of these methods has its advantages and limitations when it comes to general object detection. If you're looking for a more context-aware system that learns and adapts to various contexts, combining traditional computer vision methods with machine learning models trained on diverse images may be a step towards achieving a more generalized object detection system. However, creating a fully context-aware, automatic object detection system that adapts to any arbitrary context without any predefined objects is still a challenging area of research. ----------------- In terms of computational requirements, here's a general ranking of the mentioned object detection methods based on the computational power and RAM they might typically require: 1. OpenCV with Haar Cascades and HOG: - Computational Power Needed: Low to Moderate - RAM Requirements: Low - These methods are computationally less intensive compared to deep learning-based models. They can run on systems with lower computational power and memory. 2. Generalized Shape Detection Algorithms (scikit-image, skimage): - Computational Power Needed: Low to Moderate - RAM Requirements: Low to Moderate - While these libraries might need slightly more computational power and RAM than Haar Cascades and HOG, they are still less demanding compared to deep learning-based models. 3. TensorFlow Object Detection API: - Computational Power Needed: Moderate to High - RAM Requirements: Moderate to High - Running pre-trained models from the TensorFlow Object Detection API might require more computational power and memory compared to traditional computer vision methods due to the complexity of the deep learning models. 4. Custom Object Detection Models with Transfer Learning: - Computational Power Needed: Moderate to High - RAM Requirements: Moderate to High - Training custom object detection models with transfer learning typically requires moderate to high computational power and memory, especially during the training phase. 5. YOLO (You Only Look Once): - Computational Power Needed: High - RAM Requirements: High - YOLO models are relatively demanding in terms of computational power and memory. They require more powerful machines due to their deep neural network architecture and real-time processing capabilities. The exact computational requirements and memory usage can vary based on the specific hardware, image sizes, complexity of the models, and the scale of the operations being performed. Deep learning models, in general, tend to demand more computational resources compared to traditional computer vision methods. If you're working with large datasets or real-time processing, more powerful hardware configurations would likely be necessary to achieve optimal performance. -------- https://github.com/opencv/opencv/tree/master/data/haarcascades -------- If your goal is to detect shapes without knowing the specific objects at first, OpenCV's contour detection methods combined with image processing techniques could be more appropriate than scikit-learn. Once shapes are identified, further analysis or categorization can be performed using traditional machine learning algorithms from scikit-learn or other methods.
>>26146 Understood. If the goal is to identify various objects within a specific context (like a desk) without predefining the objects, and the lighting conditions might vary, a more flexible approach using general computer vision techniques can be applied. This could involve methods such as contour detection, edge detection, and basic image processing techniques to identify objects within the context of a desk. You might use a more generalized version of object detection that isn’t specific to particular objects but rather identifies any distinguishable shape within the context. Here’s an example: python import cv2 # Read the image image = cv2.imread('path_to_your_image.jpg') # Convert to grayscale gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # Apply thresholding or other preprocessing techniques to enhance object edges # ... # Find contours contours, _ = cv2.findContours(processed_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) detected_objects = [] for contour in contours: # Apply some conditions to filter objects based on size, shape, etc. # For instance, you might filter by area or aspect ratio area = cv2.contourArea(contour) if area > some_minimum_area_threshold: detected_objects.append(contour) # Count and display the number of detected objects print(f"Number of objects detected: {len(detected_objects)}") This code applies general techniques such as contour detection to identify distinguishable shapes within the context of the desk. The process of identifying objects relies on the uniqueness of their shapes and their contrast against the background. The challenge in this approach lies in how the algorithm distinguishes objects based on their shapes and sizes. It might not identify specific objects but rather any shape that meets certain criteria (like area, aspect ratio, etc.) within the provided context (in this case, the desk). This method might detect a variety of objects but could also identify false positives or miss some objects. Fine-tuning the conditions for object identification (like area thresholds or other characteristics) can improve the accuracy of detection within the context of the desk, considering the variability in lighting and object characteristics.
Open file (346.77 KB 696x783 1698709850406174.png)
Open file (199.93 KB 767x728 1698710469395618.png)
I suppose this is a good thread to use for discussing this concept: a swarm of small drones available for a robowaifu's use for enhanced perimeter/area surveillance, etc.
>1.6B parameter model built using SigLIP, Phi-1.5 and the LLaVA training dataset. Weights are licensed under CC-BY-SA due to using the LLaVA dataset. Try it out on Hugging Face Spaces! https://github.com/vikhyat/moondream https://huggingface.co/spaces/vikhyatk/moondream1 https://youtu.be/oDGQrOlmC1s >The model is release for research purposes only, commercial use is not allowed. >circa 6GB or 4GB quantized
>>29286 Thanks. Do you have any views on it's usefulness r/n, Anon?
Open file (84.01 KB 960x720 yuina.png)
For people looking for a Kinect, I've had success finding them at electronics recycle centers. RE:PC in Seattle had a big bin. Also, I just checked and they're going for under ten dollars on eBay lol. I had also heard that the Kinect's depth camera isn't all too necessary at this point due to how good neural networks have gotten recently. Is there any merit to that?
>>29911 Unless you're using the kinect to do some sort of 3d mapping you can get stuff like pose landmark detection using AI stuff and a standard webcam, like Gulag's open-source library Mediapipe. https://mediapipe-studio.webapps.google.com/home I use some of their models for object recognition :D
>>29911 >>29915 Thanks for both the great tips, Anons! Cheers. :^)
>>29367 I think we would need workarounds if such models are not fast enough, but wow it needs less than a second to identify common objects in a photo of a room from a home. I guess on a smaller computer it would be slower, but still. This is good enough for now, and it's just a stepping stone. Keep in mind, we don't need it as fast and general as AI in cars. The waifus will mostly look at the same home with the same objects all the time. >>29911 My issue is rather that I don't want to use a device which I can only get from recycling centers. Also, I want two cams which can move on their own and I decide on which distance they are. I guess something like Kudan will be the way to go: https://www.youtube.com/@KudanLimited
A bit odd no one mentioned LiDAR. This would allow for a better sense of depth and objects behind themselves out of ordinary vision to avoid walking backwards into someone or elbowing them.
>>30138 but the cyberninjas wear black
>>30139 Black clothes aren't that black especially as the dye fades over time. If you want to be picky being just a secondary source of sight you could use at compromise of resolution instead use radar just for a general awareness to know to carefully turn to see what is at a location.
>>30138 To add to my earlier point. I found a diy LiDAR that is supposed to cost $40 to make. https://www.instructables.com/Project-Lighthouse-360-Mini-Arduino-LiDAR/
>>30174 > I found a diy LiDAR that is supposed to cost $40 to make. I'd think that's a game-changer for the mapping need, if it's legit and reliable. Thanks, Anon! Cheers. :^)
>>30180 Considering usual cost of LiDAR I am thinking this is a bit less accurate and shorter range but it's still useful for this kind of application likely. Im not sure why the developer privated his videos. They might be still viewable through Archive.
>>30189 >Considering usual cost of LiDAR I am thinking this is a bit less accurate and shorter range but it's still useful for this kind of application likely. Yeah makes sense. >Im not sure why the developer privated his videos. In my experience, that's one of the first signs that an opensource system is going closed source. They block the assets from the publice b/c """reasons""". >They might be still viewable through Archive. Not sure what that means.
>>30190 >signs that an opensource system is going closed source He left up the files for making it though. Apparently his whole YouTube channel is gone. >Not sure what that means. I found the URL for one at least that was archived. The follow up update video wasn't archived unfortunately. https://web.archive.org/web/20210202100801/https://www.youtube.com/watch?v=uYU534Wn4lA I managed to find a similar priced one though a little more cost that used to be available as a kit but it appears to be a different design, The website seems to no longer exist. https://web.archive.org/web/20211129020703/https://curiolighthouse.wixsite.com/lighthouse Found that one from a video of some guy assembling it https://www.youtube.com/watch?v=_aRcoI25HqE>>30190 Going down that rabbit hole from YouTube recommend vids lead me to two others $44 but this one is a single point instead of 360º https://www.dfrobot.com/product-1702.html This one is $99 https://www.dfrobot.com/product-1125.html
>>30193 Wait never mind about the curiolighthouse. It seems my browser was just not redirecting to the page properly. That site is still up.
Just found out 3D cameras for sensing depth are called A "depth camera" or "3D depth sensor" or "stereoscopic depth sensor" sometimes terms like "binocular depth camera" appear. They capture color (some IR too) and depth in a single system like our vision works. Though if you used one of these premade units it would mean having only head turning not eye turning.
>>29915 Started on the kinect lite guide because I don't want giant XBOX 360 bars on my robot's face. And just now after saying it I regret hacking it apart. It's still huge after making it half the size, the length of a smartphone. https://medium.com/robotics-weekends/how-to-turn-old-kinect-into-a-compact-usb-powered-rgbd-sensor-f23d58e10eb0
>>30877 I know this is a stupid question but can you strip those components right out of the suppoirt frame and have them simply connected to the wires?
>>30879 Zoom in to the whole in the centre. Looks like there is a circuit board under there. If one were to take it out of the frame it would require adding wires and attaching back to the circuit board I imagine.
>>30879 >>30880 I expect the physical positioning of the 3 camera components is tightly registered. Could be recalibrated I'm sure, but it would need to be done.
>>30879 >Depth Perception From what I know these systems work so that it knows the distance between the two cameras and this is part of the hardware. If you want to do this yourself then your system would need to know the distance. I think Kudan Slam is a software doing that: >>29937 and >>10646 >Kudan Visual SLAM >This tutorial tells you how to run a Kudan Visual SLAM (KdVisual) system using ROS 2 bags as the input containing data of a robot exploring an area https://amrdocs.intel.com/docs/2023.1.0/dev_guide/files/kudan-slam.html >The Camera Basics for Visual SLAM >“Simultaneous Localization and Mapping usually refer to a robot or a moving rigid body, equipped with a specific sensor, that estimates its motion and builds a model of the surrounding environment, without a priori information [2]. If the sensor referred to here is mainly a camera, it is called Visual SLAM.” https://www.kudan.io/blog/camera-basics-visual-slam/ >.... ideal frame rate ... 15 fps: for applications with robots that move at a speed of 1~2m/s >The broader the camera’s field of view, the more robust and accurate SLAM performance you can expect up to some point. >...the larger the dynamic range is, the better the SLAM performance. >... global shutter cameras are highly recommended for handheld, wearables, robotics, and vehicles applications. >Baseline is the distance between the two lenses of the stereo cameras. This specification is essential for use-cases involving Stereo SLAM using stereo cameras. >We defined Visual SLAM to use the camera as the sensor, but it can additionally fuse other sensors. >Based on our experience, frame skip/drop, noise in images, and IR projection are typical pitfalls to watch out. >Color image: Greyscale images suffice for most SLAM applications >Resolution: It may not be as important as you think >Visual SLAM: The Basics - https://www.kudan.io/archives/433 Edit: Added the tutorial and articles about "Camera Basics" and "Visual SLAM Basics".

Report/Delete/Moderation Forms
Delete
Report