Q&A: Paths-in-graphs topic

Use case

What could the graph represent in a real life example?

It could be a knowledge graph such as the one from Wikidata.

What is the problem related to in a real life example?

It could be an instance of processing a query over a knowledge graph, in cases where two sub-queries come together through their two endpoints. For example, "What ministers had children who became ministers of the same country?", "What inventors were killed by their own invention?", or "What products are of the same colour as their packaging?"

Should the paths/sequences have the same starting node? Or should they be able to start at different nodes?

The recommendation is to look for paths labelled (p, q) that have a common start and end node.

Is it the idea that the user provides the start node, or is part of the task to find the location of the sequence in the graph ourselves?

The idea is that the user provides g, p, and q, and the algorithm then searches for an instance of the (p, q) pattern in the graph g.

In what scenario is it useful to check whether p and q lead to the same node?

The problem is defined such that p and q both begin at the same node and also both end at the same node. That happens when a query relates two objects in two different ways, so that it has the form "do we know any a and b that are related to each other in one way and also in another way?"

Optimization

Is it expected for us to optimize our code?

Efficiency of the implementation is relevant under the aspect "algorithms/performance." But how can we tell that an algorithm has been implemented efficiently? It can help to describe how it was optimized and how that leads to an excellent scaling behaviour with low runtimes.

Could it make sense to analyse the graph and the paths between nodes in depth, indpendent of the concrete query?

It could make sense, and would be a typical use case, to apply many different queries (p, q) to the same graph g. Therefore, analysing the graph such that new queries can be evaluated more efficiently would be a valid and relevant optimization target.

Is it an option to look at p and q as a means to make the code quicker?

Applying the same query (p, q) to many different graphs g is also a relevant realisitic use case scenario. So that might in principle make sense, assuming that you are really addressing this as a use case.

Parallelization

Is it expected for us to parallelize our code?

That is relevant under the aspect "concurrency."

Up to 50% of the grade for this aspect are accessible based on a theoretical discussion, without submitting an actual parallelized code, but if you aim at a good grade it does make sense to include one.

What options are there for parallellization of the code?

The main option would be message-passing based parallelization using MPI or ROS. Different processes can be responsible for different parts of the graph and/or different parts of the query.

What if multiple threads/ranks try to access the same object in memory at the same time?

Multiple ranks in MPI do not have any shared memory. But it can happen for threads in OpenMP, and where this would create a race condition, compiler directives for synchronization such as #pragma omp critical should be used to avoid that.

Technical recommendations

Should we consider scenarios where the nodes and/or edges have unique labels?

The recommendation is that the nodes should have unique labels. In a knowledge graph, they represent objects. The edges should not have unique labels or else the problem would become trivial and uninteresting. In a knowledge graph, edges represent relations between objects.

Where exactly is the boundary between sparse graphs and dense graphs?

"The distinction between sparse and dense graphs is rather vague, and depends on the context" [Wikipedia as of 2022]. It is a question of scaling: How does the number of edges e depend on the number of nodes n? If e is proportional to n2, we are dealing with dense graphs. If e is proportional to n, with sparse graphs. In between? There the terminology is not very clear.

What are p and q: Sequences of edge labels?

The recommendation is that p and q should be sequences of edge labels.

As the edge labels are not unique, there will be cases where there are many instances of the pattern within the same graph. Should the algorithm return a list of all these instances?

The formulation given for the topic was just as a yes/no question: Does the pattern exist in the graph or not? That would allow the algorithm to terminate after detecting the first instance of (p, q) in the graph. But it would really be closer to the technical use case of querying a graph database to look for all the instances. So it would be just as sensible to give the topic such an interpretation.

What happens if the sequences in p and q cross each other?

Maybe it is not a problem; at least there is no obvious reason why that should not be allowed.

In the code example from the lecture, how did you parse the graph data and ingest them from the file containing the data? Can you explain that part of the code?

That code follows quite closely the example from p. 128, Section 10.5 ("I/O of User-Defined Types") in the Stroustrup (2018) book.


Back to "paths in graphs" topic