The Sketch Engine
The Sketch Engine is for anyone wanting to research how words behave. It is a Corpus Query System incorporating word sketches, one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour.
Users and Uses
- Lexicography at
- Cambridge University Press, Collins, Macmillan, Oxford University Press,
- Le Robert (France), Cornelsen (Germany), Shogakukan (Japan)
- Lexicography and language research at
- National language institutes in Bulgaria, Czech Republic, Estonia, Ireland, the Netherlands, Slovakia and Slovenia.
- Research and teaching
- at many universities in many countries
- Language technology
- for language modelling, text prediction, planning Google adwords campaigns
- by individuals and agencies worldwide
- The annual, 5 day, Lexicom workshop in Lexicography and Lexical Computing is a good place to learn how to use the Sketch Engine in depth
The Sketch Engine is a product of Lexical Computing – a small research company, founded by Adam Kilgarriff in 2003. It works at the intersection of corpus and computational linguistics, and is committed to an empiricist approach to the study of language in which corpora play a central role: for a very wide range of linguistic questions, if a suitable corpus is available, it will help our understanding. Its strap line is ‘corpora for all’.
To be able to provide corpus services, LCL needs corpora. As at May 2013 we have large corpora for 52 languages. (‘Large’ meaning over 1 million words; in most cases corpora are over 100 million words.) For the most part these are collected from the web – LCL is a lead player in the ‘web as corpus’ initiative – and have involved collaborations with language experts for the languages in question, for example:
- with Paul Thompson, Hilary Nesi and colleagues at the Universities of Warwick, Reading, Birmingham and Coventry over the BASE and BAWE corpora of Academic English
- with Silvia Bernardini and colleagues at SSLMIT, University of Bologna, for their very large (ca 2 billion word) web corpora of German, Italian, English, French (DeWaC, ItWaC, UKWaC, FrWaC)
- with Simon Krek and colleagues at Ljubljana University, on corpora, lemmatisation, part-of-speech tagging and the Sketch Grammar for Slovene
- with Phuong Le-Hong for lemmatisation, part-of-speech tagging and the Sketch Grammar for Vietnamese
Adam Kilgarriff is company founder and owner. He is a corpus and computational linguist. Following a PhD at Sussex University on word meaning, he worked at Longman Dictionaries and Brighton University prior to setting up LCL. He has published widely in the areas of word senses, corpora and lexicography and given keynote lectures at a number of conferences. He organised the first SENSEVAL competition for the evaluation of word sense disambiguation systems. He has chaired the Association for Computational Linguistics Special Interest Groups on the Lexicon (2000–2004) and Web as Corpus (2006–2009; founding chair), he was also a European Association for Lexicography board member 2002–2006. He is a Visiting Research Fellow at the University of Leeds.
Pavel Rychlý is a computer scientist and computational linguist. His PhD thesis was on optimal designs for corpus query systems, and he has, since then, been developing, first, the Manatee system, and since 2003, the Sketch Engine. He is a lecturer and senior researcher at the NLP Centre, Masaryk University in Brno, Czech Republic.