DAT121 øving 3

DAT121 lab worksheet 3: Data and objects (part 2) and regression basics

1. Reflection / relation to project topic

Brief reflection: How would the content from the covered three lectures (18th, 21st, and 23rd August) relate to your project topic? How could it be put to use? Is there anything concrete in this direction that you would plan to research further, or that we should discuss together?

2. Knowledge graph construction challenge (S. Borgo et al.)

Familiarize yourself with JSON-LD format and with TTL format. Install Protégé on your system.
Create a knowledge graph (ABox) for the following information content: "A flower is red in the summer. As time passes, the colour changes. In autumn the flower is brown." (source: Borgo et al. scenarios, therein 3a). Write it up as a JSON-LD or a TTL file, or both. You can use any pre-existing or self-designed TBox, or just provide an ABox without any TBox at all.
Visualize the most relevant part of your solution using E-R notation.

3. Annotation of websites using schema.org

Schema.org is a semantic artefact supported by a community that includes Google, Microsoft, and Yahoo, and that organizes itself through the W3C. Explore schema.org from its online index. Additionally you could also load the ontology in TTL format into Protégé; however, schema.org should already be perfectly intelligible through its web-based documentation.

The HTML code of the module website (https://home.bawue.de/~horsch/teaching/dat121/index.html) contains a rudimentary schema.org based JSON-LD annotation. You can find it between the tags:

<script type="application/ld+json">
   …
</script>

How would you propose to modify and/or extend the annotation? Use the Google Rich Results Test to make sure that your revised annotation is processed by Google correctly.

4. From JSON-LD to TTL format

Take the annotation that you introduced/modified in problem 2 using JSON-LD and write up the same triples in TTL format. (Not all the triples from the JSON-LD annotation - only those that you newly introduced or where you made modifications.)

5. Querying Wikidata

The Wikidata SPARQL end point is a good device for training yourself in the practical use of SPARQL. The documentation contains a long list of query examples. The IRIs used by Wikidata are resolvable, employing the following prefixes:

@prefix wd: <https://wikidata.org/wiki/>
@prefix wdt: <https://wikidata.org/wiki/Property:>

Accordingly, consider the following query from the list of examples:

# Birth places of German poets
#
SELECT ?subj ?subjLabel ?place ?placeLabel ?birthyear
WHERE {
   ?subj wdt:P106 wd:Q49757 .
   ?subj wdt:P19 ?place .
   ?place wdt:P17 wd:Q183 .
   ?subj wdt:P569 ?dob.

   BIND(YEAR(?dob) as ?birthyear)
   FILTER(?birthyear > 800)
   SERVICE wikibase:label {  bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
} ORDER BY ?dob

For example, in the first triple, wdt:P106 expands to https://wikidata.org/wiki/Property:P106, which is an object property labelled "occupation." The whole triple has the meaning "?subj has the occupation poet."

Practice formulating your own queries; for example, try asking for a table of Nobel laureates who are/were at some point affiliated with UiO. Repeat for NTNU. How many results do you obtain in each case, and do you agree with the response to your queries?

6. Correlation and significance

Apply regression using the statsmodels library to the data from the current table of Eliteserien. Based on these data, is there a significant correlation between the number of goals scored and the number of goals received? Is there a significant correlation between the number of draws and the goal difference? How about the number of draws and the square of the goal difference?

7. Model validation and testing

Consider the given data set consisting of 50 x-y pairs.

Split the data set into training, validation, and test data. (For example, at a ratio 32:9:9.) Attempt a linear regression of the type y = ax + b, using the training data set only.
Construct at least two other candidate models, one based on a quadratic equation y = ax² + bx + c and another one based on a hypothesis of your choice. For developing each of these candidate models, only the training data set should be used.
Determine the root mean square deviation of each of your three models from the validation data. Select the model that performs best during validation as your final model.
Test the final model, using the test data set only. Determine the margin of error from an appropriate measure such as two times the root mean square deviation between predicted and actual values of y for the test data.

8. Time series

As an example time series, import and consider the development of the logarithm of the NOK:EUR exchange rate over the past year.

(a) Construct the residual (i.e., remainder) of the time series with respect to a zeroth-order (constant average value), first-order, and second-order regression. For each of the three residual curves, (b) compute the autocorrelation function, (c) use it to obtain at least a rough estimate of the decorrelation time, and (d) determine the interpolation error of the regression using Flyvbjerg-Pedersen block averaging.

9. Discussion of glossary terms

The following glossary terms have been proposed for "Data and objects" (second lecture, 18th August): Knowledge base, knowledge graph (also: ABox), ontology (also: TBox), resource, triple. For "Regression basics," we had: Block averaging, decorrelation time (also: autocorrelation time), hypothesis, influence diagram, p value, regression analysis, residual quantity, supervised learning, uncertainty, validation and testing.

Propose glossary entries for the two terms highlighted for discussion: "Knowledge graph" (ABox) and "uncertainty."
Do we actually agree on the definitions and descriptions of these terms? Even if we agree in principle, would you suggest any formulations that should be improved?
If you speak Norwegian: Can you find any sources in the disciplinary literature in Norwegian where the terms are used? Are we translating them into Norwegian correctly?
Do you see more terms from the 18th, 21st, and 23rd August lectures that would need a clarification and an agreed definition?

(submit through Canvas by end of 24th August 2023)

Index