Wortindex in AutoDocs/Includes

amifrog · Beitrag von **amifrog** » 18. Januar 2020 18:32

Hallihallo.

Ich tue mich z.Zt. etwas schwer mit der Indexerstellung von den AutoDocs. Die soll später mal eine sinnvolle Suchfunktion bereitstellen, darum sollte das schon irgendwo sinnvoll funktionieren.

Dazu habe ich mal ein paar Fragen.
1.
Unten im Screenshot sieht man ja ein paar 'Sonder'zeichen, die natürlich eigentlich nicht zu dem Wort gehören, sondern Code-spezifisch sind. Sollte ich die auch rausfiltern?
Was ist mit solchen Sachen: Window->Flags ? Sollte ich da auch die -> rausnehmen? Rein logisch müßte ich das alles in echte Wörter trennen, aber, will man das für einen AutoDoc-Browser? Was ist mit den Includes? Das ist eher Sourcecode, also müßte das dort drinbleiben?
2.
Momentan speichere ich ja den Index noch nicht, sondern mache ihn pro Datei neu onthefly. Das dauert bei dos.doc oder intuition.doc inkl. Sortierung schon mal eine halbe Sekunde oder so, ohne Indizierung, nur die Wortfindung.
(Die String.split() Methode, die ich benutze, nimmt leider kein Array als Argument, darum muß ich jedes Trenner-Char einzeln auf den Stringbuffer hetzen...

)
Normalerweise sollte die Datenbank aber natürlich ALLE Dateien umfassen, mit Links usw. dorthin.
Wie speichert/lädt man so etwas am Besten, aber, Achtung, leicht implementierbar!?
3.
Gibt es Bezeichner oder Identifier oder Funktionsnamen etc. mit nur 2 Buchstaben?
Dito mit einem - Bindestrich?

@MODs
gerne auch die Bilder aus dem Allgemein-Thread hierher verschieben, falls nötig.

amifrog · Beitrag von **amifrog** » 18. Januar 2020 20:10

Nur zur Erläuterung von Punkt 2:
Natürlich würde ich die Datenbank-Erzeugung auslagern, aber das Laden sollte schon easy sein; z.B. mag ich keine SQL oder so dazulinken, falls möglich.

Demzufolge könnte ich natürlich einzelne Optionen beim Zufügen von Verzeichnissen (für den Scan) wählbar machen beim Erzeugen des Indexes.
Einfacher wäre natürlich 'universell'.

Beitrag von **ZeroG** » 18. Januar 2020 20:17

1.
Unten im Screenshot sieht man ja ein paar 'Sonder'zeichen, die natürlich eigentlich nicht zu dem Wort gehören, sondern Code-spezifisch sind. Sollte ich die auch rausfiltern?
Was ist mit solchen Sachen: Window->Flags ? Sollte ich da auch die -> rausnehmen? Rein logisch müßte ich das alles in echte Wörter trennen, aber, will man das für einen AutoDoc-Browser? Was ist mit den Includes? Das ist eher Sourcecode, also müßte das dort drinbleiben?

Da hast du schon den richtigen Riecher:
Sachen wie Window->Flags sind auch im Autodoc (C-)Quelltext. Die Includes sind Quelltext.

Ich denke ich würde den Wortindex nur von Teilstücken der Autodocs erstellen (oder zumindest die EXAMPLE Abschnitte überspringen).

Gibt es Bezeichner oder Identifier oder Funktionsnamen etc. mit nur 2 Buchstaben?
Dito mit einem - Bindestrich?

Mir fällt gerade keiner kein.

amifrog · Beitrag von **amifrog** » 19. Januar 2020 03:16

Danke für den Input.
Natürlich wird dies kein "Hilfe"-Index werden, wie man es gewohnt ist.
Es wird manuelle Einträge geben, und feste Links usw. usf..
Aber:
Es werden nicht alle Worte sichtbar oder verlinkt. Ich bin ja nicht doof.

Stattdessen wird der Index erzeugt durch z.B. erstmalig eine allgemeine Sprachdatei (ein Buch z.B.) zu scannen, um dann jene erhaltenen Wörter auszuschließen von dem Scan der AutoDocs.
Es wird also keine "the/is/are..." geben im Lexikon.
Das Ziel ist natürlich, so wenig wie möglich manuell zu tun. Ich möchte keine Filterlisten bearbeiten osä..

amifrog · Beitrag von **amifrog** » 8. Februar 2020 22:49

So sieht der Test dann aus.

Ich habe erstmal Statistics reingebaut, um zu sehen, was überhaupt am meisten vorkommt. Die Ausgabe kann man sicher noch gestaffelt feintunen, evtl. eine Top10-Liste oder so.
'Finden' bleibt natürlich drin; Links zu den Zeilen muß ich noch einbauen.

sdkb_findall_exec-synopsis.jpg

Index von exec.doc, mit Wordcount, max Hit hat 'library':

sdkb_index_exec-library.png

Im SDKBrowser wird es das Indexing selbst natürlich nicht geben, final, weil die DB im DB-Maker erstellt wird.

sdkb_DBmaker_GUI_test0.png

Für die Exclude-Wörter habe ich testweise zwei alte Bücher hinzugezogen, die 100% frei von Computer-Terminologie sein sollten:
'Old Testament from King James Bible' und
'War and Peace'.
Häufigstes Wort in der og. Bibel? "That" !!

sdkb_index_bible-that.jpg

Check: War & Peace: "That" !!

sdkb_index_warpeace-that.jpg

Zum Test auch nochmal die 'FindAll'-Funktion für die Bibel:

sdkb_find_bible-spirit.jpg

Zur Zeit werden bei 'Find' noch Wort-teile gefunden, statt nur Wörter. Das wird später natürlich optional.

amifrog hat geschrieben: ↑19. Januar 2020 03:16 Stattdessen wird der Index erzeugt durch z.B. erstmalig eine allgemeine Sprachdatei (ein Buch z.B.) zu scannen, um dann jene erhaltenen Wörter auszuschließen von dem Scan der AutoDocs.
Es wird also keine "the/is/are..." geben im Lexikon.
Das Ziel ist natürlich, so wenig wie möglich manuell zu tun. Ich möchte keine Filterlisten bearbeiten osä..

amifrog · Beitrag von **amifrog** » 29. Februar 2020 23:07

amifrog hat geschrieben: ↑18. Januar 2020 18:32 2.
Momentan speichere ich ja den Index noch nicht, sondern mache ihn pro Datei neu onthefly. Das dauert bei dos.doc oder intuition.doc inkl. Sortierung schon mal eine halbe Sekunde oder so, ohne Indizierung, nur die Wortfindung.
...

Aus gutem Grund.
Wie jeder weiß, hat der Entwickler üblicherweise einen schnelleren Computer als der durchschnittliche Anwender.
Was einem wie nahezu unmittelbar responsive erscheint, auf der eigenen Maschine, ist möglicherweise beim Anwender so langsam, daß es wie ein Absturz aussieht.

Meine Indexierung läuft nicht als eigener Thread, sodaß eine Verzögerung unmittelbar sichtbar wird.
Wie man in den Debug-Ausgaben unten sieht, schreibe ich Milestones raus, sobald wieder ein gewisser Zuwachs beim Index zu vermelden ist. Nur dann 'polle' ich wieder das System, damit das ganze responsiv bleibt. Nun, das Zeitfenster, in dem die App gegenüber dem OS responsiv bleibt, ist etwa 2-3 Sekunden.

Darum habe ich das Tool mal für Win32 (also z.B. Windows7) gebaut und die Indexierung getimed.
Auf einem alten Laptop. Sehr alt: Athlon X2 DualCore QL-65 2,10 GHz (Win7).
Ich nehme es mal vorweg: Realtime geht auf so einem System nicht. Die App "hängt" teilweise mehrere Minuten. Das kann ich natürlich ändern, aber das ist hier nicht der Punkt.
( Die Index-Generierung variiert natürlich, abhängig von der Minimallänge für ein Wort, was ich hier mal mit 3 vorgab).

Code: Alles auswählen

>>E:\base\Documentation\AutoDocs/intuition.doc
>>WSize!
>>WSize!
Wording buffer ...
done!
Sorting buffer ...
done.
0
!Indexing: MileStone Timing is 1/10 secs.
Index is >    100. time: 0
Index is >    300. time: 1
Index is >    500. time: 9
Index is >    800. time: 31
Index is >   1000. time: 106
Index is >   1100. time: 148
Index is >   1300. time: 166
Index is >   1500. time: 208
Index is >   1800. time: 337
Index is >   2000. time: 537
Index is >   2500. time: 762
Index is >   3000. time: 1505
Index is >   3500. time: 2690
Index is >   4000. time: 4441
Index is >   4500. time: 6476
Index is >   5000. time: 8831
Index is >   5500. time: 11150
indexing elapsed time: 1680 secs.
Elemente >0 in array: 5673 von 75874  ( 41984 doublettes )
Max repeats : the > [ 3466 ] value was : 3465
>>E:\base\Documentation\AutoDocs/dos.doc
Wording buffer ...
done!
Sorting buffer ...
done.
0
!Indexing: MileStone Timing is 1/10 secs.
Index is >    100. time: 0
Index is >    300. time: 2
Index is >    500. time: 12
Index is >    800. time: 28
Index is >   1000. time: 132
Index is >   1100. time: 178
Index is >   1300. time: 232
Index is >   1500. time: 286
Index is >   1800. time: 401
Index is >   2000. time: 608
Index is >   2500. time: 805
Index is >   3000. time: 1550
Index is >   3500. time: 2739
Index is >   4000. time: 4408
Index is >   4500. time: 7800
Index is >   5000. time: 11470
Index is >   5500. time: 15599
Index is >   6000. time: 19796
indexing elapsed time: 2375 secs.
Elemente >0 in array: 6416 von 130884  ( 64759 doublettes )
Max repeats : the > [ 4251 ] value was : 4250
>>WSize!

Process terminated

Hier die Resultate meines (auch nicht mehr jugendlichen) MBP: CoreI7quad 2,6GHz.

Code: Alles auswählen

>>/Volumes/SDK_53_30/base/Documentation/AutoDocs/intuition.doc
Wording buffer ...
done!
Sorting buffer ...
done.
0
!Indexing: MileStone Timing is 1/10 secs.
Index is >    100. time: 0
Index is >    300. time: 0
Index is >    500. time: 0
Index is >    800. time: 0
Index is >   1000. time: 1
Index is >   1100. time: 1
Index is >   1300. time: 1
Index is >   1500. time: 1
Index is >   1800. time: 2
Index is >   2000. time: 2
Index is >   2500. time: 3
Index is >   3000. time: 4
Index is >   3500. time: 5
Index is >   4000. time: 6
Index is >   4500. time: 7
Index is >   5000. time: 8
Index is >   5500. time: 9
indexing elapsed time: 1 secs.
Elemente >0 in array: 5673 von 75874  ( 41984 doublettes )
Max repeats : the > [ 3466 ] value was : 3465
>>/Volumes/SDK_53_30/base/Documentation/AutoDocs/dos.doc
Wording buffer ...
done!
Sorting buffer ...
done.
0
!Indexing: MileStone Timing is 1/10 secs.
Index is >    100. time: 0
Index is >    300. time: 0
Index is >    500. time: 0
Index is >    800. time: 0
Index is >   1000. time: 1
Index is >   1100. time: 1
Index is >   1300. time: 2
Index is >   1500. time: 2
Index is >   1800. time: 3
Index is >   2000. time: 3
Index is >   2500. time: 4
Index is >   3000. time: 5
Index is >   3500. time: 5
Index is >   4000. time: 6
Index is >   4500. time: 6
Index is >   5000. time: 7
Index is >   5500. time: 7
Index is >   6000. time: 8
indexing elapsed time: 0 secs.
Elemente >0 in array: 6416 von 130884  ( 64759 doublettes )
Max repeats : the > [ 4251 ] value was : 4250
>>/Volumes/SDK_53_30/ASCII-Files/warpeace.txt
>>WSize!
Wording buffer ...
done!
Sorting buffer ...
done.
0
!Indexing: MileStone Timing is 1/10 secs.
Index is >    100. time: 0
Index is >    300. time: 0
Index is >    500. time: 0
Index is >    800. time: 1
Index is >   1000. time: 2
Index is >   1100. time: 2
Index is >   1300. time: 3
Index is >   1500. time: 3
Index is >   1800. time: 4
Index is >   2000. time: 5
Index is >   2500. time: 6
Index is >   3000. time: 7
Index is >   3500. time: 8
Index is >   4000. time: 9
Index is >   4500. time: 10
Index is >   5000. time: 11
Index is >   5500. time: 12
Index is >   6000. time: 13
Index is >   6500. time: 14
Index is >   7000. time: 15
Index is >   7500. time: 16
Index is >   8000. time: 17
Index is >   8500. time: 18
Index is >   9000. time: 19
Index is >   9500. time: 20
Index is >  10000. time: 21
Index is >  11000. time: 22
Index is >  11500. time: 23
indexing elapsed time: 2 secs.
Elemente >0 in array: 18981 von 1186680  ( 444814 doublettes )
Max repeats : the > [ 31785 ] value was : 31784
gui setup stored.

Process complete

Wie man sieht, macht das Core i7 MBP selbst das "Krieg und Frieden" oder die Bibel in ein oder zwei Sekunden, auf dem alten Win7 Laptop mit Athlon hingegen dauert schon dos.library viele Minuten!
Ich lasse gerade das War&Peace indizieren auf dem Athlon, mal sehen, ob das heute noch fertig wird...

amifrog · Beitrag von **amifrog** » 29. Februar 2020 23:29

Ich werde das natürlich noch auf einem System testen, das beide OS' installiert hat.
Wenn das dann auch so gravierend unterschiedlich ist, schiebe ich es auf den Compiler.

Wortindex in AutoDocs/Includes

Wortindex in AutoDocs/Includes

Re: Wortindex in AutoDocs/Includes

Re: Wortindex in AutoDocs/Includes

Re: Wortindex in AutoDocs/Includes

Re: Wortindex in AutoDocs/Includes

Re: Wortindex in AutoDocs/Includes

Re: Wortindex in AutoDocs/Includes