Getting started with Stanford CoreNLP

PROJECT INFO
PROJECT INFO

This is a two part series, in first part we will discuss THEORY and in second part we will create CoreNLP project.

  1. <You are here>
  2. Getting started with Stanford CoreNLP | A Stanford CoreNLP Tutorial

INTRODUCTION
INTRODUCTION

Stanford CoreNLP was developed in Java language and is the result of a study by the Natural Language Processing Group at Stanford University. 

The Stanford NLP Group includes members of both the Linguistics Department and the Computer Science Department, and is part of the Stanford AI Lab.

Getting started with Stanford CoreNLP

What is Stanford CoreNLP?
What is Stanford CoreNLP?

Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.

What are Feature of Stanford CoreNLP?
What are Feature of Stanford CoreNLP?

  • An integrated NLP toolkit with a broad range of grammatical analysis tools
  • A fast, robust annotator for arbitrary texts, widely used in production
  • A modern, regularly updated package, with the overall highest quality text analytics
  • Support for a number of major (human) languages
  • Available APIs for most major modern programming languages
  • Ability to run as a simple web service

What tools are intigrated with Stanford CoreNLP?
What tools are intigrated with Stanford CoreNLP?

 Stanford CoreNLP integrates many of Stanford’s NLP tools, including

  1. The part-of-speech (POS) tagger,
  2. The named entity recognizer (NER)
  3. The parser
  4. The coreference resolution system,
  5. Sentiment analysis,
  6. Bootstrapped pattern learning
  7. Open information extraction.
  8. Moreover, an annotator pipeline can include additional custom or third-party annotators. 

How Stanford CoreNLP works?
How Stanford CoreNLP works?

CoreNLP’s analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Getting started with Stanford CoreNLP

Getting started with Stanford CoreNLP

Getting started with Stanford CoreNLP

Getting started with Stanford CoreNLP

Part-of-speech tagging
Part-of-speech tagging

In corpus linguistics, part-of-speech tagging (POS tagging), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. 

  1. CC Coordinating conjunction
  2. CD Cardinal number
  3. DT Determiner
  4. EX Existential there
  5. FW Foreign word
  6. IN Preposition or subordinating conjunction
  7. JJ Adjective
  8. JJR Adjective, comparative
  9. JJS Adjective, superlative
  10. LS List item marker
  11. MD Modal
  12. NN Noun, singular or mass
  13. NNS Noun, plural
  14. NNP Proper noun, singular
  15. NNPS Proper noun, plural
  16. PDT Predeterminer
  17. POS Possessive ending
  18. PRP Personal pronoun
  19. PRP$ Possessive pronoun
  20. RB Adverb
  21. RBR Adverb, comparative
  22. RBS Adverb, superlative
  23. RP Particle
  24. SYM Symbol
  25. TO to
  26. UH Interjection
  27. VB Verb, base form
  28. VBD Verb, past tense
  29. VBG Verb, gerund or present participle
  30. VBN Verb, past participle
  31. VBP Verb, non­3rd person singular present
  32. VBZ Verb, 3rd person singular present
  33. WDT Wh­-determiner
  34. WP Wh­pronoun
  35. WP$ Possessive wh-­pronoun
  36. WRB Wh-­adverb

Annotation and Annotator
Annotation and Annotator

CoreNLP’s core package includes two classes: Annotation and Annotator.

Annotations are data structures that hold the results of the annotators. Annotations are generally maps.

Annotators are more like functions, but they operate on Annotations rather than Objects.

Annotators can perform tokenize, parse, NER, POS. Annotators and Annotations are integrated in AnnotationPipelines. Stanford CoreNLP inherits the AnnotationPipeline class and customizes NLP Annotators .

Stanford CoreNLP ANNOTATORS:

  1. tokenize,
  2. cleanxml,
  3. ssplit,
  4. pos,
  5. lemma,
  6. ner,
  7. regexner,
  8. sentiment,
  9. truecase,
  10. parse, depparse, 
  11. dcoref,
  12. relation,
  13. natlog,
  14. quote.

We are providing some important annotations here

tokenize
ssplit
pos
lemma
ner
parse
dcoref
sentiment
tokenize

1. DESCRIPTION:
Tokenizes the text. This splits the text into roughly “words”, using rules or methods suitable for the language being processed. Sometimes the tokens split up surface words in ways suitable for further NLP-processing, for example, “isn’t” becomes “is” and “n’t”.

The tokenizer saves the beginning and end character offsets of each token in the input text.

2. ANNOTATOR CLASS NAME:

TokenizerAnnotator

3. GENERATED ANNOTATION:

  • TokensAnnotation (list of tokens);
  • CharacterOffsetBeginAnnotation,
  • CharacterOffsetEndAnnotation,
  • TextAnnotation (for each token)

ssplit

1. DESCRIPTION:

Splits a sequence of tokens into sentences.

2. ANNOTATOR CLASS NAME:

  • WordsToSentencesAnnotator

3. GENERATED ANNOTATION:

  • SentencesAnnotation

pos

1. DESCRIPTION:

Labels tokens with their POS tag

2. ANNOTATOR CLASS NAME:

  • POSTaggerAnnotator

3. GENERATED ANNOTATION:

  • PartOfSpeechAnnotation

lemma

1. DESCRIPTION:

Generates the word lemmas for all tokens in the corpus.

2. ANNOTATOR CLASS NAME:

  • MorphaAnnotator

3. GENERATED ANNOTATION:

  • LemmaAnnotation

ner

1. DESCRIPTION:

Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities.

Named entities are recognized using a combination of three CRF sequence taggers trained on various corpora, such as ACE and MUC. Numerical entities are recognized using a rule-based system.

Numerical entities that require normalization, e.g., dates, are normalized to NormalizedNamedEntityTagAnnotation.

2. ANNOTATOR CLASS NAME:

  • NERClassifierCombiner

3. GENERATED ANNOTATION:

  • NamedEntityTagAnnotation
  • NormalizedNamedEntityTagAnnotation

parse

1. DESCRIPTION:

Provides full syntactic analysis, using both the constituent and the dependency representations.

The constituent-based output is saved in TreeAnnotation.

We generate three dependency-based outputs, as follows: basic, uncollapsed dependencies, saved in BasicDependenciesAnnotation;

collapsed dependencies saved in CollapsedDependenciesAnnotation; and collapsed dependencies with processed coordinations, in CollapsedCCProcessedDependenciesAnnotation.

Most users of our parser will prefer the latter representation.

2. ANNOTATOR CLASS NAME:

  • ParserAnnotator

3. GENERATED ANNOTATION:

  • TreeAnnotation,
  • BasicDependenciesAnnotation,
  • CollapsedDependenciesAnnotation,
  • CollapsedCCProcessedDependenciesAnnotation

dcoref

1. DESCRIPTION:

Implements both pronominal and nominal coreference resolution. The entire coreference graph (with head words of mentions as nodes) is saved in CorefChainAnnotation.

2. ANNOTATOR CLASS NAME:

  • DeterministicCorefAnnotator

3. GENERATED ANNOTATION:

  • CorefChainAnnotation

sentiment

1. DESCRIPTION:

Implements Socher et al’s sentiment model. Attaches a binarized tree of the sentence to the sentence level CoreMap. The nodes of the tree then contain the annotations from RNNCoreAnnotations indicating the predicted class and scores for that subtree.

2. ANNOTATOR CLASS NAME:

  • SentimentAnnotator

3. GENERATED ANNOTATION:

  • entimentCoreAnnotations.
  • AnnotatedTree

What is Annotation Pipeline in Stanford CoreNLP?
What is Annotation Pipeline in Stanford CoreNLP?

CoreNLP implements an annotation pipeline. An AnnotationPipeline is run on the Annotation. An AnnotationPipeline is essentially a List of Annotators, each of which is run in turn. (And an AnnotationPipeline is itself an Annotator, so you can actually nest AnnotationPipelines inside each other.)

Each Annotator reads the value of one or more keys from the Annotation, does some natural language analysis, and then writes the results back to the Annotation.

 Typically, each Annotator stores its analyses under different keys, so that the information stored in an Annotation is cumulative rather than things being overwritten. The overall picture is given in this picture.

Getting started with Stanford CoreNLP

Default Annotation pipeline is StanfordCoreNLP

REFERENCES
REFERENCES

  1. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.8216 & rep = rep1&type=pdf
  2. http://nlp.stanford.edu:8080/parser/
  3. https://stanfordnlp.github.io/CoreNLP/pipelines.html#annotations-and-annotators
  4. https://stanfordnlp.github.io/CoreNLP/api.html