The exploratory, semantic similarity searching is becoming widespread
in digital libraries, and math ones are no exception.
For working mathematicians and their use of
digital mathematical libraries (DML) as the
Czech Digital Mathematics
Library DML-CZ [1]
or European Digital
Mathematics Library (EuDML) [2] we have designed and implemented math-aware similarity
computation framework based on leading edge
topic modelling techniques implemented by
Gensim
software package [3].
Studies on the classification of math papers
done for DML-CZ [4] have been
tested and deployed in EuDML, where for given
paper ten most semantically similar papers
are computed and shown. In the latest experiments we are
evaluating several possible representations
of mathematical formulae to get the
semantically similar papers.
Quality of similarity is measured by
comparation to the similarity matrix induced
from the Mathematical Subject Classifications
every paper is marked up by.
In the talk we will report a) about the evaluation
of the similarities computed by several different methods,
b) on the experience from 20 months of deployment
in EuDML and more than 5 years in DML-CZ,
c) about the importance of representing
formulae even for paper similarity computations,
d) on setting up Gensim for the
math-aware use in DML projects.
|
References
- P. Sojka, J. Ráakosník, From Pixels and Minds to the Mathematical Knowledge in Digital
Library, In P. Sojka (ed.): Proceedings of DML 2008: Towards a Digital Mathematics
Library. Brno: Masaryk University, 2008. pp. 17-27, https://is.muni.cz/publication/
762453?lang=en.
- J. Borbinha, T. Bouche, A. Nowiński, P. Sojka, Project EuDML{A First Year Demon-
stration, In J.H. Davenport et al. (eds.): Proceedings of CICM 2011, Springer, LNAI
vol. 6824, 2011. pp. 281-284, doi:10.1007/978-3-642-22673-1 21.
- R. Řehůřek, P. Sojka, Software Framework for Topic Modelling with Large Corpora. In
Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks. Valletta,
Malta: University of Malta, 2010. pp. 46-50, http://is.muni.cz/publication/884893/
en.
- R. Řehůřek, P. Sojka, Automated Classification and Categorization of Mathematical
Knowledge, In S. Autexier et al. (eds.): Proceedings of CICM 2008, Springer, LNAI vol.
5144, 2008. pp. 543-557, 10.1007/978-3-540-85110-3 44.
|