Pedro Ortiz Suarez

Cited by

	All	Since 2019
Citations	3916	3908
h-index	12	12
i10-index	15	15

1600

800

400

1200

20202021202220232024233 398 648 1557 1059

Public access

View all

10 articles

0 articles

available

not available

Based on funding mandates

Co-authors

Benoît SagotDirecteur de recherches at Inria, head of the ALMAnaCH teamVerified email at inria.fr
Laurent RomaryInriaVerified email at inria.fr
Yoann DupontMaître de conférences, Sorbonne NouvelleVerified email at sorbonne-nouvelle.fr
Benjamin MullerResearcher at MetaVerified email at meta.com
Louis MartinFacebook A.I. Research / InriaVerified email at fb.com
Eric Villemonte De la ClergerieINRIAVerified email at inria.fr
Djamé SeddahInria (Almanach) & Université Paris Sorbonne (Paris 4)Verified email at paris-sorbonne.fr
Julien AbadjiResearch Engineer, InriaVerified email at inria.fr
Simon GabayUniversity of GenevaVerified email at unige.ch
Rachel BawdenInriaVerified email at inria.fr
Philippe GambetteAssociate Professor of Computer Science, Université Gustave EiffelVerified email at u-pem.fr
Matthieu FuteralPhD student, Inria ParisVerified email at inria.fr
Alix ChaguéPhD student at Inria and Université de MontréalVerified email at inria.fr
Luca FoppianoNational Institute for Materials ScienceVerified email at nims.go.jp
Yoshihiko TakanoNational Institute for Materials Science (NIMS)Verified email at nims.go.jp
Colin LeongUniversity of DaytonVerified email at udayton.edu
Daniel van StrienHugging FaceVerified email at huggingface.co
Angelina McMillan-MajorUniversity of WashingtonVerified email at uw.edu
Yacine JerniteResearch Scientist, HuggingFaceVerified email at cs.nyu.edu
Stella BidermanBooz Allen Hamilton, EleutherAIVerified email at bah.com

Pedro Ortiz Suarez

Other namesPedro Javier Ortiz Suárez

Senior Research Scientist, Common Crawl Foundation

Verified email at commoncrawl.org - Homepage

Language modeling Corpus linguistics Named Entity Recognition Computational Linguistics Machine


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Bloom: A 176b-parameter open-access multilingual language model T Le Scao, A Fan, C Akiki, E Pavlick, S Ilić, D Hesslow, R Castagné, ...	1370	2023
CamemBERT: a Tasty French Language Model L Martin, B Muller, PJ Ortiz Suárez, Y Dupont, L Romary, ... Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020	1154	2020
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures PJ Ortiz Suárez, B Sagot, L Romary 7th Workshop on the Challenges in the Management of Large Corpora, 2019	436*	2019
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets J Kreutzer, I Caswell, L Wang, A Wahab, D van Esch, N Ulzii-Orshikh, ... Transactions of the Association for Computational Linguistics 10, 50-72, 2022	235*	2022
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages PJ Ortiz Suárez, L Romary, B Sagot Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020	220*	2020
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv eprints, page J Abadji, P Ortiz Suarez, L Romary, B Sagot arXiv preprint arXiv:2201.06642, 2022	144	2022
The bigscience roots corpus: A 1.6 tb composite multilingual dataset H Laurençon, L Saulnier, T Wang, C Akiki, A Villanova del Moral, ... Advances in Neural Information Processing Systems 35, 31809-31826, 2022	136	2022
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus J Abadji, PJO Suárez, L Romary, B Sagot CMLC 2021-9th Workshop on Challenges in the Management of Large Corpora, 2021	55	2021
Building a user-generated content north-african arabizi treebank: Tackling hell D Seddah, F Essaidi, A Fethi, M Futeral, B Muller, PJ Ortiz Suárez, ... Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020	46	2020
Establishing a New State-of-the-Art for French Named Entity Recognition PJ Ortiz Suárez, Y Dupont, B Muller, L Romary, B Sagot Proceedings of The 12th Language Resources and Evaluation Conference, 4631–4638, 2020	24*	2020
From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French S Gabay, P Ortiz Suarez, A Bartz, A Chagué, R Bawden, P Gambette, ... arXiv preprint arXiv:2202.09452, 2022	16	2022
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources A McMillan-Major, Z Alyafeai, S Biderman, K Chen, F De Toni, G Dupont, ... arXiv preprint arXiv:2201.10066, 2022	14	2022
Automatic extraction of materials and properties from superconductors scientific literature L Foppiano, PB Castro, P Ortiz Suarez, K Terashima, Y Takano, M Ishii Science and Technology of Advanced Materials: Methods 3 (1), 2153633, 2023	11	2023
Les modèles de langue contextuels Camembert pour le français: impact de la taille et de l'hétérogénéité des données d'entrainement L Martin, B Muller, PJ Ortiz Suárez, Y Dupont, L Romary, E Clergerie, ... Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP …, 2020	11	2020
Perplexed by quality: A perplexity-based method for adult and harmful content detection in multilingual heterogeneous web data T Jansen, Y Tong, V Zevallos, PO Suarez arXiv preprint arXiv:2212.10440, 2022	10	2022
Bertrade: Using contextual embeddings to parse old french L Grobol, M Regnault, PO Suarez, B Sagot, L Romary, B Crabbé 13th Language Resources and Evaluation Conference, 2022	8	2022
Tokenizer Choice For LLM Training: Negligible or Crucial? M Ali, M Fromm, K Thellmann, R Rutmann, M Lübbering, J Leveling, ... arXiv preprint arXiv:2310.08754, 2023	7	2023
SinNer@CLEF-HIPE2020: Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers PJ Ortiz Suárez, Y Dupont, G Lejeune, T Tian CLEF 2020 Working Notes 2696, 2020	6*	2020
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus M Popa-Fabre, PJ Ortiz Suárez, B Sagot, ÉV de la Clergerie Proceedings of the 8th Workshop on Challenges in the Management of Large …, 2020	3	2020
How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures M Khemakhem, I Galleron, G Williams, L Romary, PJ Ortiz Suárez	3	2019

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors