Getting started with Stanford CoreNLP
PROJECT INFO
PROJECT INFO
This is a two part series, in first part we will discuss THEORY and in second part we will create CoreNLP project.
INTRODUCTION
INTRODUCTION
Stanford CoreNLP was developed in Java language and is the result of a study by the Natural Language Processing Group at Stanford University.
The Stanford NLP Group includes members of both the Linguistics Department and the Computer Science Department, and is part of the Stanford AI Lab.
What is Stanford CoreNLP?
What is Stanford CoreNLP?
Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.
What are Feature of Stanford CoreNLP?
What are Feature of Stanford CoreNLP?
- An integrated NLP toolkit with a broad range of grammatical analysis tools
- A fast, robust annotator for arbitrary texts, widely used in production
- A modern, regularly updated package, with the overall highest quality text analytics
- Support for a number of major (human) languages
- Available APIs for most major modern programming languages
- Ability to run as a simple web service
What tools are intigrated with Stanford CoreNLP?
What tools are intigrated with Stanford CoreNLP?
Stanford CoreNLP integrates many of Stanford’s NLP tools, including
- The part-of-speech (POS) tagger,
- The named entity recognizer (NER),
- The parser,
- The coreference resolution system,
- Sentiment analysis,
- Bootstrapped pattern learning
- Open information extraction.
- Moreover, an annotator pipeline can include additional custom or third-party annotators.
How Stanford CoreNLP works?
How Stanford CoreNLP works?
CoreNLP’s analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.
Part-of-speech tagging
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph.
- CC Coordinating conjunction
- CD Cardinal number
- DT Determiner
- EX Existential there
- FW Foreign word
- IN Preposition or subordinating conjunction
- JJ Adjective
- JJR Adjective, comparative
- JJS Adjective, superlative
- LS List item marker
- MD Modal
- NN Noun, singular or mass
- NNS Noun, plural
- NNP Proper noun, singular
- NNPS Proper noun, plural
- PDT Predeterminer
- POS Possessive ending
- PRP Personal pronoun
- PRP$ Possessive pronoun
- RB Adverb
- RBR Adverb, comparative
- RBS Adverb, superlative
- RP Particle
- SYM Symbol
- TO to
- UH Interjection
- VB Verb, base form
- VBD Verb, past tense
- VBG Verb, gerund or present participle
- VBN Verb, past participle
- VBP Verb, non3rd person singular present
- VBZ Verb, 3rd person singular present
- WDT Wh-determiner
- WP Whpronoun
- WP$ Possessive wh-pronoun
- WRB Wh-adverb
Annotation and Annotator
Annotation and Annotator
CoreNLP’s core package includes two classes: Annotation and Annotator.
Annotations are data structures that hold the results of the annotators. Annotations are generally maps.
Annotators are more like functions, but they operate on Annotations rather than Objects.
Annotators can perform tokenize, parse, NER, POS. Annotators and Annotations are integrated in AnnotationPipelines. Stanford CoreNLP inherits the AnnotationPipeline class and customizes NLP Annotators .
Stanford CoreNLP ANNOTATORS:
- tokenize,
- cleanxml,
- ssplit,
- pos,
- lemma,
- ner,
- regexner,
- sentiment,
- truecase,
- parse, depparse,
- dcoref,
- relation,
- natlog,
- quote.
We are providing some important annotations here
tokenize
ssplit
pos
lemma
ner
parse
dcoref
sentiment
tokenize
1. DESCRIPTION:
Tokenizes the text. This splits the text into roughly “words”, using rules or methods suitable for the language being processed. Sometimes the tokens split up surface words in ways suitable for further NLP-processing, for example, “isn’t” becomes “is” and “n’t”.
The tokenizer saves the beginning and end character offsets of each token in the input text.
2. ANNOTATOR CLASS NAME:
TokenizerAnnotator
3. GENERATED ANNOTATION:
- TokensAnnotation (list of tokens);
- CharacterOffsetBeginAnnotation,
- CharacterOffsetEndAnnotation,
- TextAnnotation (for each token)
ssplit
1. DESCRIPTION:
Splits a sequence of tokens into sentences.
2. ANNOTATOR CLASS NAME:
- WordsToSentencesAnnotator
3. GENERATED ANNOTATION:
- SentencesAnnotation
pos
1. DESCRIPTION:
Labels tokens with their POS tag
2. ANNOTATOR CLASS NAME:
- POSTaggerAnnotator
3. GENERATED ANNOTATION:
- PartOfSpeechAnnotation
lemma
1. DESCRIPTION:
Generates the word lemmas for all tokens in the corpus.
2. ANNOTATOR CLASS NAME:
- MorphaAnnotator
3. GENERATED ANNOTATION:
- LemmaAnnotation
ner
1. DESCRIPTION:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities.
Named entities are recognized using a combination of three CRF sequence taggers trained on various corpora, such as ACE and MUC. Numerical entities are recognized using a rule-based system.
Numerical entities that require normalization, e.g., dates, are normalized to NormalizedNamedEntityTagAnnotation.
2. ANNOTATOR CLASS NAME:
- NERClassifierCombiner
3. GENERATED ANNOTATION:
- NamedEntityTagAnnotation
- NormalizedNamedEntityTagAnnotation
parse
1. DESCRIPTION:
Provides full syntactic analysis, using both the constituent and the dependency representations.
The constituent-based output is saved in TreeAnnotation.
We generate three dependency-based outputs, as follows: basic, uncollapsed dependencies, saved in BasicDependenciesAnnotation;
collapsed dependencies saved in CollapsedDependenciesAnnotation; and collapsed dependencies with processed coordinations, in CollapsedCCProcessedDependenciesAnnotation.
Most users of our parser will prefer the latter representation.
2. ANNOTATOR CLASS NAME:
- ParserAnnotator
3. GENERATED ANNOTATION:
- TreeAnnotation,
- BasicDependenciesAnnotation,
- CollapsedDependenciesAnnotation,
- CollapsedCCProcessedDependenciesAnnotation
dcoref
1. DESCRIPTION:
Implements both pronominal and nominal coreference resolution. The entire coreference graph (with head words of mentions as nodes) is saved in CorefChainAnnotation.
2. ANNOTATOR CLASS NAME:
- DeterministicCorefAnnotator
3. GENERATED ANNOTATION:
- CorefChainAnnotation
sentiment
1. DESCRIPTION:
Implements Socher et al’s sentiment model. Attaches a binarized tree of the sentence to the sentence level CoreMap. The nodes of the tree then contain the annotations from RNNCoreAnnotations indicating the predicted class and scores for that subtree.
2. ANNOTATOR CLASS NAME:
- SentimentAnnotator
3. GENERATED ANNOTATION:
- entimentCoreAnnotations.
- AnnotatedTree
What is Annotation Pipeline in Stanford CoreNLP?
What is Annotation Pipeline in Stanford CoreNLP?
CoreNLP implements an annotation pipeline. An AnnotationPipeline
is run on the Annotation
. An AnnotationPipeline
is essentially a List
of Annotator
s, each of which is run in turn. (And an AnnotationPipeline
is itself an Annotator, so you can actually nest AnnotationPipeline
s inside each other.)
Each Annotator
reads the value of one or more keys from the Annotation
, does some natural language analysis, and then writes the results back to the Annotation
.
Typically, each Annotator
stores its analyses under different keys, so that the information stored in an Annotation
is cumulative rather than things being overwritten. The overall picture is given in this picture.
Default Annotation pipeline is StanfordCoreNLP
REFERENCES
REFERENCES