Segment Anything but Farms: Comparing Segmentation Paradigms for Rural UAV Captured Ultra-High-Resolution Imagery

Author/Creator ORCID

Date

Department

Program

Citation of Original Publication

Rights

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Abstract

In South Asia, where 80% of farms are smallholder plots under 0.5 hectares, a 30 cm earthen ridge (”Aali”) separates two fields that appear identical in every measurable visual feature, same crop, same growth stage, same irrigation, same spectral signature; yet they represent distinct land parcels requiring individual damage attribution after flooding. This is the fundamental challenge of agricultural boundary detection: the boundaries that matter encode ownership and land tenure, not visual discontinuity. Foundation models trained on datasets emphasizing highcontrast physical edges fail catastrophically. SAM family models (224M to 848M parameters) achieve only 35-51% Field IoU despite zero-shot building detection at 80-95% on the same dataset. Even ‘DelineateAnything’, pre-trained on 22.9 million global farm instances, achieves only 72% mAP@50 on our 4.31 cm/pixel drone imagery. We systematically document why classical computer vision, foundation models, and extensive post-processing (SAM2 + 5stage pipeline with 4,096 prompts, watershed segmentation, six geometric filters) cannot achieve deployment accuracy on semantic boundaries. U-Net (MiT-B4) achieves 95.37% Field IoU, YOLOv11 reaches 92.9% mAP@50 (+20.9 pp over the 22.9M-instance baseline), and our novel panoptic formulation—treating fields as instance ”things” and surrounding context as semantic ”stuff”—achieves 97% Field IoU while extracting 2,646 individual parcels, meeting the accuracy required for flood compensation and land validation. All datasets were collected during the 2025 monsoon season in Nepal’s Koshi River basin (Saptari and Sunkoshi districts), capturing active flood conditions and post-event field states.