It's a Tree... It's a Graph... It's a Traph! : Designing an on-file multi-level graph index for the Hyphe web crawler
2018-02-03 / 2018-02-04
Hyphe, a web crawler for social scientists developed by the SciencesPo médialab, introduced the novel concept of web entities to provide a flexible and evolutive way of grouping web pages in situations where the notion of website is not relevant enough (either too large, for instance with Twitter accounts, newspaper articles or Wikipedia pages, or too constrained to group together multiple domains or TLDs...). This comes with technical challenges since indexing a graph of linked web entities as a dynamic layer based on a large number of URLs is not as straightforward as it may seem. We aim at providing the graph community with some feedback about the design of an on-file index - part Graph, part Trie - named the "Traph", to solve this peculiar use-case. Additionally we propose to retrace the path we followed, from an old Lucene index, to our experiments with Neo4j, and lastly to our conclusion that we needed to develop our own data structure in order to be able to scale up.