publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2025
-
Perception of Visual Content: Differences Between Humans and Foundation ModelsNardiena A. Pratama, Shaoyang Fan, and Gianluca DemartiniProceedings of the International AAAI Conference on Web and Social Media, Jun 2025Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator’s efforts. This study explores the similarity between human-generated and ML-generated annotations of images across diverse socio-economic contexts (RQ1) and their impact on ML model performance and bias (RQ2). We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels, covering various daily activities and home environments. ML captions and human labels show highest similarity at a low-level, i.e., types of words that appear and sentence structures, but all annotations are consistent in how they perceive images across regions. ML Captions resulted in best overall region classification performance, while ML Objects and ML Captions performed best overall for income regression. ML annotations worked best for action categories, while human input was more effective for non-action categories. These findings highlight the notion that both human and machine annotations are important, and that human-generated annotations are yet to be replaceable.
- Are Large Language Models Good Data Preprocessors?Elyas Meguellati, Nardiena Pratama, Shazia Sadiq, and 1 more authorIn Companion Proceedings of the ACM on Web Conference 2025, Sydney NSW, Australia, Jun 2025
High-quality textual training data is essential for the success of multimodal data processing tasks, yet outputs from image captioning models like BLIP and GIT often contain errors and anomalies that are difficult to rectify using rule-based methods. While recent work addressing this issue has predominantly focused on using GPT models for data preprocessing on relatively simple public datasets, there is a need to explore a broader range of Large Language Models (LLMs) and tackle more challenging and diverse datasets. In this study, we investigate the use of multiple LLMs, including LLaMA 3.1 70B, GPT-4 Turbo, and Sonnet 3.5 v2 to refine and clean the textual outputs of BLIP and GIT. We assess the impact of LLM-assisted data cleaning by comparing downstream-task (SemEval 2024 Subtask ’Multilabel Persuasion Detection in Memes”) models trained on cleaned versus non-cleaned data. While our experimental results show improvements when using LLM-cleaned captions, statistical tests reveal that most of these improvements are not significant. This suggests that while LLMs have the potential to enhance data cleaning and repairing, their effectiveness may be limited depending on the context they are applied to and the complexity of the task and the level of noise in the text. Our findings highlight the need for further research into the capabilities and limitations of LLMs in data preprocessing pipelines, especially when dealing with challenging datasets, contributing empirical evidence to the ongoing discussion about integrating LLMs into data preprocessing pipelines.
- Fast Synthetic Data Generation for Case-Specific Entity Extraction in Criminal InvestigationsMads Skipanes, Nardiena Pratama, Kyle Porter, and 1 more authorIn Proceedings of the Digital Forensics Doctoral Symposium, , Jun 2025
In major criminal investigations, the manual analysis of police reports for the categorization of entities is a resource-intensive task prone to human error. Recent advances in Named Entity Recognition (NER) models offer promising solutions for automating this process, potentially reducing both time and error rates.This paper demonstrates the effectiveness of fine-tuning a NER model using a publicly shared synthetic dataset inspired by real case files. Notably, we leverage a large language model (LLM) for generating both the synthetic data and the annotations used for training. This approach enables investigators to rapidly develop case-specific models tailored to ongoing investigations. To structure this effort, we propose an ontology for entity extraction in criminal cases, focusing on key entities, such as persons, seized items, communication profiles, vehicles, locations, organizations, and financial profiles. Our model achieves an average weighted F1-score of 94.2% on the synthetic dataset.For further validation, we manually annotated a small dataset of confidential data from two homicide cases, achieving an average weighted F1-score of 81.6%. Our results demonstrate that our approach can at times generalize well to real case files.