Information


How to cite

This website has been described in Witte, et al. (2021). As the website is built around the methods and materials developed in Witte and Köbler (2019), citing that source would also be a good idea.


Word metric calculator settings

Temporary spelling changes

When calculating orthographic transparency, a set of temporary spelling changes may be used. For example, a digit such as 3 may be temporary replaced with a letter repressentation such as "tre". In order to use such temporary replacement, simply add the the desired replacements in the appropriate text area on the calculation page. The format of the replacements should have one replacement per row, on which the original spelling and the desired replacement are separated by a tab character.

Optional checking and correction of manually enterred phonetic transcriptions

Manually entered phonetic transcriptions can be checked auotomatically by the website. This checking will occur prior to the word-metric calculations, and any transcription errors detected will be displayed in an error-messages box on the results page.


The AFC-list

The AFC-list is a Word metric database for the Swedish language published in Witte and Köbler (2019) under a CC-BY 4.0 license. The AFC-list contains word spellings, phonetic transcriptions, word frequencies, as well as quite a few additional types of data arguably important for the perception of speech, for a total of 816 404 Swedish words.


Column descriptions

The list below contains short descriptions of each column heading used on the SWM website. For further description of the word metrics, see Witte and Köbler (2019) and Witte et al. (2021).

  • OrthographicForm: The orthographic form / spelling of the word
  • PhoneticForm: The standard phonetic transcription, according to the convention described in Witte and Köbler (2019)
  • PhonotacticType: The phonotactic type
  • SyllableCount: The number of syllables in the phonetic transcription
  • PhoneCount: The number of phonetic segments in the phonetic transcription
  • ZipfValue: The Zipf-scale value
  • RawWordTypeFrequency: The total number of occurrences of the spelling in the internet blog corpora in Witte and Köbler (2019)
  • RawDocumentCount: The total number of different internet blogs in which the word occurs, based on the internet blog corpora in Witte and Köbler (2019)
  • PLD1WordCount: The number of phonetic neighbors
  • OLD1WordCount: The number of orthographic neighbors
  • PLDx_Average: The average edit distance to the x number of closest phonetic neighbours in the AFC-list (Cf. Yarkoni, Balota, & Yap, 2008)
  • OLDx_Average: The average edit distance to the x number of closest orthographic neighbours in the AFC-list (Cf. Yarkoni, Balota, & Yap, 2008)
  • PNDP: The Zipf-scale weighted phonetic neighborhood density probability
  • ONDP: The Zipf-scale weighted orthographic neighborhood density probability
  • GIL2P_OT_Average: The average grapheme-initial letter-to-pronunciation orthographic transparency
  • GIL2P_OT_Min: The minimum grapheme-initial letter-to-pronunciation orthographic transparency
  • PIP2G_OT_Average: The average pronunciation-initial phone-to-grapheme orthographic transparency
  • PIP2G_OT_Min: The minimum pronunciation-initial phone-to-grapheme orthographic transparency
  • G2P_OT_Average: The average grapheme-to-pronunciation orthographic transparency
  • SSPP_Average: The average normalized stress and syllable based phonotactic probability
  • SSPP_Min: The minimum normalized stress and syllable based phonotactic probability
  • PSP_Sum: The summed positional segment probability
  • PSBP_Sum: The summed position specific bi-phone probability
  • S_PSP_Average: The average standardized positional segment probability
  • S_PSBP_Average: The average standardized position specific bi-phone probability
  • OrthographicIsolationPoint: The (zero-based) index of the phone at which a particular word can be uniquely discriminated from all other words in the AFC-list
  • PhoneticIsolationPoint: The (zero-based) index of the letter at which a particular word can be uniquely discriminated from all other words in the AFC-list (Cf. The COHORT model of speech perception, Cf. Marslen-Wilson & Welsh, 1978)
  • LetterCount: The number of letters in the orthographic form
  • GraphemeCount: The number of graphemes in the sonograph array
  • DiGraphCount: The number of two-letter graphemes in the sonograph array
  • TriGraphCount: The number of three-letter graphemes in the sonograph array
  • LongGraphemesCount: The number of graphemes longer than three letters in the sonograph array
  • SpecialCharacter: The existence of special characters in the orthographic form
  • UpperCase: The proportion of times the word begins with an upper case letter
  • MostCommonPoS: The most common word-class assignment in the internet blog corpora in Witte and Köbler (2019)
  • MostCommonLemma: The most common lemma assignment in the internet blog corpora in Witte and Köbler (2019)
  • ForeignWord: The foreign word marking
  • Abbreviation: The abbreviation marking
  • Acronym: The acronym marking
  • HomographCount: The number of homographs
  • HomophoneCount: The number of homophones
  • NumberOfSenses: The total number of senses of all lemmas as described in Witte and Köbler (2019)
  • ReducedTranscription: The reduced phonetic transcription, as defined in Witte and Köbler (2019)
  • TemporarySyllabification: The phonetic transcription re-syllabified by the syllabification tool used for calculating SSPP.
  • PLD1Transcriptions: The reduced phonetic transcriptions and raw word frequencies (delimited by colons) of phonetic neighbors (delimited by vertical lines), sorted according to word frequency. The first word is the look-up word.
  • OLD1Spellings: The orthographic forms and raw word frequencies (delimited by commas) of orthographic neighbors (delimited by vertical lines), sorted according to word frequency. The first word is the look-up word.
  • PLDx_Neighbors: The x number of closest phonetic neighbours in the AFC-list (Cf. Yarkoni, Balota, & Yap, 2008)
  • OLDx_Neighbors: The x number of closest orthographic neighbours in the AFC-list (Cf. Yarkoni, Balota, & Yap, 2008)
  • Sonographs: The sonographs (as defined in Witte and Köbler, 2019)
  • Homographs: The reduced phonetic transcriptions of homographs, as defined in Witte and Köbler (2019)
  • Homophones: All homophone spellings
  • AllPoS: The word-class assignments, and their relative distributions in the internet blog corpora in Witte and Köbler (2019)
  • AllLemmas: The lemma assignments, and their relative distributions in the internet blog corpora in Witte and Köbler (2019)
  • SSPP: The normalized stress and syllable based phonotactic probability for each phoneme combination
  • PSP: The positional segment probability for each phoneme
  • PSBP: The position specific bi-phone probability for each bi-phone
  • S_PSP: The standardized positional segment probability each phoneme
  • S_PSBP: The standardized position specific bi-phone probability for each bi-phone
  • GIL2P_OT: The grapheme-initial letter-to-pronunciation orthographic transparency of each sonograph
  • PIP2G_OT: The pronunciation-initial phone-to-grapheme orthographic transparency of each sonograph
  • G2P_OT: The grapheme-to-pronunciation orthographic transparency of each sonograph
  • Tone: The pitch accent
  • MainStressSyllable: The primary stressed syllable (1-based index)
  • SecondaryStressSyllable: The secondary stressed syllable (1-based index)
  • PhoneCountZero: The number of phones when empty word-initial syllable onsets and word-final codas are counted as phones
  • PossiblePoSCount: The number of possible word classes
  • PossibleLemmaCount: The number of possible lemmas
  • ManualEvaluations: A field that may specify detailed comments concerning the word-metric calculations
  • ManualEvaluationsCount: The number of comments stored for each word in ManualEvaluations
  • IPA: A phonetic transcription without syllable boundaries

Phonetic transcription convention

This website uses the AFC-list Phonetic transcription convention described in Witte and Köbler (2019).

Basically, the phonetic transcriptions need to be in the IPA format and adhere to the following principles:

  • The transcription should be surrounded by square brackets e.g. [ˈ uː ɖ].
  • Transcriptions should be phonetic rather than phonemic.
  • Phonetic length should only be used in stressed syllables.
  • Syllable boundary marks need to be added between syllables, however exactly correct syllabification is not vital.
  • Items within the phonetic transcriptions (except for phonetic length markings) should be separated from each other by a blank space.
  • Only phonetic characters displayed in the box below should be used.

Sample words (spellings / transcriptions):

Some sample spelling and transcription combinations are presented in the box below.


References

  1. Marslen-Wilson, W. D., & Welsh, A. (1978). Processing interactions and lexical access during word recognition in continuous speech. Cognitive Psychology, 10(1), 29-63. doi:10.1016/0010-0285(78)90018-X
  2. Witte, E., & Köbler, S. (2019). Linguistic Materials and Metrics for the Creation of Well-Controlled Swedish Speech Perception Tests. Journal of Speech, Language, and Hearing Research : JSLHR, 62(7), 2280-2294. doi:10.1044/2019_JSLHR-S-18-0454
  3. Witte, E., Edlund, J., Jönsson, A., & Danielsson, H. (2021). Swedish Word Metrics: A Swe-Clarin resource for psycholinguistic research in the Swedish language. Paper presented at the CLARIN Annual Conference 2021.
  4. Yarkoni, T., Balota, D., & Yap, M. (2008). Moving beyond Coltheart's N: a new measure of orthographic similarity. Psychonomic Bulletin & Review, 15(5), 971-979. doi:10.3758/PBR.15.5.971