Materials scientists developing new functional materials for technologies like smartphones and automobiles face significant challenges in predicting material properties, as theoretical models alone cannot provide reliable predictions due to complex relationships between composition, synthesis methods, and resulting properties. A research team led by Dr. Yukari Katsura at Japan's National Institute for Materials Science has developed two artificial intelligence tools that leverage large language models to automate the extraction of experimental data from scientific papers, dramatically accelerating the construction of materials property databases.
The tools are designed to streamline data collection for Starrydata, a materials property database launched in 2015 that previously relied on manual data extraction from papers. "Graphs in the millions of papers published to date contain valuable experimental data collected by past researchers, and much of it remains untapped," says Dr. Katsura. The research was recently published in the journal Science and Technology of Advanced Materials: Methods at https://doi.org/10.1080/27660400.2025.2590811.
The first tool, Starrydata Auto-Suggestion for Sample Information, is already integrated into the Starrydata2 web system and uses OpenAI's GPT via API to read paper text and suggest candidate entries for data fields pre-designed for specific materials domains. When users paste text from a paper's abstract or experimental methods section, the system automatically displays candidate entries in English below each input field.
The second tool, Starrydata Auto-Summary GPT, deconstructs entire open-access paper PDFs uploaded by users and automatically summarizes all descriptions of figures, tables, and samples appearing in papers as structured data in JSON format. Generated using ChatGPT's custom GPT feature, the resulting data can be viewed as easy-to-read tables in web browsers. While this data isn't currently incorporated directly into the Starrydata database, it dramatically accelerates data collectors' work in locating target information and entering data.
Dr. Katsura notes that many publishers prohibit artificial intelligence use on paper PDFs, so the team is currently developing the system to target open-access papers. The tools represent a significant advancement because LLMs can perform flexible information extraction that considers background knowledge and context, enabling automation of converting complex information sources like scientific papers into structured data.
The implications for materials science and related industries are substantial. Building large-scale datasets of experimental data through this approach could enable researchers to gain inspiration through bird's-eye views of data and realize property predictions based on empirical trends using machine learning. "A paper is a logical structure assembled to convey the author's claims, but by deconstructing it and returning it to the form of experimental data, other researchers can also use it for their own research," explains Dr. Katsura.
Currently, Starrydata has progressed in building databases for specific materials science fields like thermoelectric materials and magnets. As an open dataset usable for new materials development, it's beginning to be utilized by leading researchers worldwide. The team aims to raise broader awareness of large-scale experimental data's potential and establish paper data collection as a recognized research form within the scientific community. This development could significantly accelerate materials discovery cycles across industries that depend on advanced materials, from electronics to automotive to energy sectors.


