Auteurs Centres de recherche Disciplines et Collections Projets
Français English
 
 

It's a Tree... It's a Graph... It's a Traph!

 

Notice

Type:   Communication non publiée
 
Titre:   It's a Tree... It's a Graph... It's a Traph! : Designing an on-file multi-level graph index for the Hyphe web crawler
 
Auteur(s):   Plique, Guillaume - Médialab (Auteur)
Jacomy, Mathieu (1980-...) - Médialab (Auteur)
Ooghe, Benjamin - Médialab (Auteur)
Girard, Paul - Médialab (Auteur)
 
Résumé:   [en] Hyphe, a web crawler for social scientists developed by the SciencesPo médialab, introduced the novel concept of web entities to provide a flexible and evolutive way of grouping web pages in situations where the notion of website is not relevant enough (either too large, for instance with Twitter accounts, newspaper articles or Wikipedia pages, or too constrained to group together multiple domains or TLDs...). This comes with technical challenges since indexing a graph of linked web entities as a dynamic layer based on a large number of URLs is not as straightforward as it may seem. We aim at providing the graph community with some feedback about the design of an on-file index - part Graph, part Trie - named the "Traph", to solve this peculiar use-case. Additionally we propose to retrace the path we followed, from an old Lucene index, to our experiments with Neo4j, and lastly to our conclusion that we needed to develop our own data structure in order to be able to scale up.
 
 

Fichiers