-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathquickstart.optic
158 lines (136 loc) · 6.89 KB
/
quickstart.optic
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
// You can get syntax highlighting for this file with the following vscode
// extension: https://marketplace.visualstudio.com/items?itemName=Stract.Optics
// Optics is a domain-specific language that lets you write small programs to
// influence the search results that gets returned to you from Stract or other
// search engines that supports Optics. The goal of Optics is to give you very
// fine grained control over your own search experience. Let's see how this is
// done.
// As you might have guessed, comments can either be line-comments with "//" or
// block-comments "/* ... */". Optics consists of a sequence of rules where
// each rule defines how a particular search result should be altered given that
// they match the specific rule.
Rule {
Matches {
Title("Top * of 2022")
},
Action(Downrank(3))
};
Rule {
Matches {
Url("reddit.com/r/*/comments")
},
Matches {
Url("lemmy.world/post/*/comments")
},
Action(Boost(3))
};
// Let's unwrap what happens in the first two rules. The first rule matches all
// search results where some pattern occurs in the title of the search result.
// The pattern "Top * of 2022" indicates that a matching search result must
// contain the term "top" (non-case-sensitive) followed by any number of
// wildcard terms, which is then followed by the terms "of" and "2022" in
// sequence. This will e.g. match the titles "Top movies of 2022" and "See the
// top tourist attractions of 2022 in Paris". Note that the wildcard can match
// more than one term and that the pattern does not have to be at the beginning
// or end of the title. You can use the special token "|" in a pattern to
// indicate the beginning or end of the title, so the pattern "|Top * of 2022"
// would only match "Top movies of 2022" but not "See the top tourist
// attractions of 2022 in Paris". Patterns also only match entire terms so the
// pattern "an*how" would not match the term "anyhow" but instead "an <wildcard>
// how".
// The second rule matches search results where the pattern matches the url of
// the result. When there are several `Matches` blocks in a rule, the rule is
// applied if any block matches.
// Both rules specify an action that will be applied to the matching
// results: the first rule downranks these types of listicles whereas the second
// rule boosts reddit comments. The action can either be
// `Action(Boost(<float>))`, `Action(Downrank(<float>))` or `Action(Discard)` to
// completely discard the result. If no action is specified in a rule it is
// equivalent to `Action(Boost(0))` which doesn't change the score of the result.
// You can specify any number of patterns in a `Matches` block. A result will
// only match if it matches all the patterns in the block. As an example, this
// rule will only discard listicles from "esquire.com" but other results from
// the same domain will not be discarded.
Rule {
Matches {
Title("Top * of 2022"),
Domain("esquire.com")
},
Action(Discard)
};
// Currently we support the following match-locations in the `Matches` block:
// - Site
// - Url
// - Domain
// - Title
// - Description
// - Content
// - Schema
// The most special of these is probably the `Schema` location. This allows you
// to match results that contains specific https://schema.org entities. This is
// also the only match-location where you cannot use the special pattern
// characters "*" and "|". `Schema` only supports simple patterns. The following
// rule boosts all pages that contains the
// https://schema.org/DiscussionForumPosting entity
Rule {
Matches {
Schema("DiscussionForumPosting")
},
Action(Boost(5))
};
// By default, any search result that does not match the optic would simply not
// have any special action applied to it and the search result's score would be
// unaltered. This behaviour can be changed by specifying `DiscardNonMatching`
DiscardNonMatching;
// Now all search results that does not match any of the specified rules will be
// discarded.
// The `Rule` block we have looked at so far only alters search results that
// matches the specific rules and leaves other results intact. To explain the
// next part, it will be helpful to know a little bit about how most web search
// engines rank their results.
// Back in the days when Google launched, it completely disrupted the existing
// search engines by having a lot of technical advantages. One of the most
// ground breaking things they did was the introduction of their core ranking
// algorithm called PageRank. The idea is, that websites that have links from
// other trustworthy websites must themselves be trustworthy. The idea to
// analyze website links and use this for ranking gave Google way better search
// results than their competitors. However, websites quickly realized this and
// started actively building links to their own website, which skews the
// PageRank metric. In essence, websites with a high PageRank are no longer
// guaranteed to be trustworthy but might just be the best websites at link
// building.
// At Stract we use a similar link based metric called Harmonic Centrality. It
// is calculated by taking the sum of the inverse distances from all other
// websites to a particular website. Imagine site A links to B that then links
// to C. This gives the following normalized centralities
// A ---> B ---> C
// 0 0.33 0.5
// While harmonic centrality is believed to be more resistant to link building
// than PageRank, it is still susceptible to it. Both PageRank, Harmonic
// centrality and other link based metrics also suffer from the fact that they
// are not personalized. They create a single global score for each site in the
// webgraph even though different people might find different sites more
// trustworthy than others.
// Let's now get back from our detour and see what this means for our optics.
// In optics you have the ability to like and dislike sites. We then use these
// site. This is the same thing that happens whenever you like a search result
// directly in the Stract UI.
Like(Site("news.ycombinator.com"));
Dislike(Site("w3schools.com"));
// We then boost sites that is linked from the same sites as the ones you like,
// and downrank sites that is linked from the same sites as the ones you
// dislike. The idea is that if you like site A in the following graph you will
// probably also like site B since they both have links from site C.
// C
// / \
// v v
// A B
// One thing to note here is that only simple patterns can be used inside
// `Like(Site("..."))` and `Dislike(Site("..."))`. The special tokens "*" and
// "|" will not work here.
// Last but not least, when you have developed your optic it can be installed in
// Stract by uploading the Optic to github, and copy the "raw" url into
// https://stract.com/settings/optics. E.g. the url for this Optic is
// "https://raw.githubusercontent.com/StractOrg/sample-optics/main/quickstart.optic"
// Alternatively you can host it anywhere that returns a simple plain-text HTTP
// response and is publicly available.