Proceedings of the
35th European Safety and Reliability Conference (ESREL2025) and
the 33rd Society for Risk Analysis Europe Conference (SRA-E 2025)
15 – 19 June 2025, Stavanger, Norway

P2PNeXt: Advancing Crowd Counting and Localization Using an Enhanced P2PNet Architecture

Thomas Golda1,a, Jann Sänger2, John Hildenbrand1,b and Jürgen Metzler1,c

1Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB, Germany.

2University of Applied Sciences Karlsruhe (HKA), Germany.

ABSTRACT

Accurate crowd counting and localization are essential for ensuring public safety and managing risks in densely populated areas, such as during large events or in urban environments. They enable authorities to monitor and manage large gatherings effectively, thereby preventing overcrowding and potential accidents. In emergency situations, accurate crowd data can facilitate quicker and more efficient responses by enabling the identification of high-density areas that may require immediate attention. From the computer vision perspective, these are crucial capabilities, demanding both precision in object counting and accurate spatial localization of individuals. In this study, we propose an enhancement to the P2PNet, a point-based framework for crowd counting, by integrating a modern neural network architecture, ConvNeXt, as the backbone.We explored two primary directions for the backbone integration: utilizing a feature pyramid to combine various feature maps, and employing a single feature map from ConvNeXt, bypassing the feature pyramid. Initial experiments indicated that the single-feature-map approach, particularly with the very first feature map, yielded superior results. However, through a few critical modifications to the feature pyramid module - including bilinear interpolation for upsampling, batch normalization across convolutions, and the inclusion of ReLU in the decoder - the feature pyramid approach ultimately outperformed the single feature map method. The revised feature pyramid, especially the first feature map output from the decoder module, achieved the best results across multiple datasets. This way our research contributes to the broader understanding of risk assessment and management, offering a robust solution for precise crowd density estimation and localization.

Keywords: Crowd Counting, Computer vision, Machine learning, ConvNeXt, P2PNet, Point-based framework, Public safety.



Download PDF