Join Udacity’s Natural Language Processing (NLP) Nanodegree and become an expert NLP engineer: Enroll in the NLP Nanodegree today!
Natural Language Processing (NLP) is an area of computer science and artificial intelligence concerned with interactions between computer and human (natural) language.
Wondering what is NLTK? According to the wiki, “The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.”
Prerequisites for this course:
Let’s Explore the NLTK Package!
1. Import nltk
Import nltk in-order to use its functions.
import nltk
2. Convert text to lower case:
It is necessary to convert the text to lower case as it is case sensitive.
text = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit" lower_text = text.lower() print(lower_text)
[OUTPUT]: this is a demo text for nlp using nltk. full form of nltk is natural language toolkit
3. Tokenize Word
Tokenize sentences to get the tokens of the text i.e breaking the sentences into words.
text = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit" word_tokens = nltk.word_tokenize(text) print (word_tokens)
[OUTPUT]: ['This', 'is', 'a', 'Demo', 'Text', 'for', 'NLP', 'using', 'NLTK', '.', 'Full', 'form', 'of', 'NLTK', 'is', 'Natural', 'Language', 'Toolkit']
4. Tokenize Sentence
Tokenize sentences if there are more than 1 sentence i.e breaking the sentences to list of sentence.
text = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit" sent_token = nltk.sent_tokenize(text) print (sent_token)
[OUTPUT]: ['This is a Demo Text for NLP using NLTK.', 'Full form of NLTK is Natural Language Toolkit']
5. Stop words removal
Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they don’t carry any information.
import nltk from nltk.corpus import stopwords stopword = stopwords.words('english') text = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit" word_tokens = nltk.word_tokenize(text) removing_stopwords = [word for word in word_tokens if word not in stopword] print (removing_stopwords)
[OUTPUT]: ['This', 'Demo', 'Text', 'NLP', 'using', 'NLTK', '.', 'Full', 'form', 'NLTK', 'Natural', 'Language', 'Toolkit']
6. Lemmatize
Lemmatize the text so as to get its root form eg: functions, functionality as function.
import nltk from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer #is based on The Porter Stemming Algorithm stopword = stopwords.words('english') wordnet_lemmatizer = WordNetLemmatizer() text = "the dogs are barking outside. Are the cats in the garden?" word_tokens = nltk.word_tokenize(text) lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in word_tokens] print (lemmatized_word)
[OUTPUT]: ['the', 'dog', 'are', 'barking', 'outside', '.', 'Are', 'the', 'cat', 'in', 'the', 'garden', '?']
7. Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.
import nltk from nltk.corpus import stopwords from nltk.stem import SnowballStemmer #is based on The Porter Stemming Algorithm stopword = stopwords.words('english') snowball_stemmer = SnowballStemmer('english') text = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit" word_tokens = nltk.word_tokenize(text) stemmed_word = [snowball_stemmer.stem(word) for word in word_tokens] print (stemmed_word)
[OUTPUT]: ['this', 'is', 'a', 'demo', 'text', 'for', 'nlp', 'use', 'nltk', '.', 'full', 'form', 'of', 'nltk', 'is', 'natur', 'languag', 'toolkit']
8. Get word frequency
Count the word occurrence using FreqDist library.
import nltk from nltk import FreqDist text = "This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit" word = nltk.word_tokenize(text.lower()) freq = FreqDist(word) print (freq.most_common(5))
[OUTPUT]: [('is', 2), ('nltk', 2), ('this', 1), ('a', 1), ('demo', 1)]
9. POS (Part of Speech) tagging
POS tag helps us to know the tags of each word like whether a word is noun, adjective etc.
import nltk text = "the dogs are barking outside." word = nltk.word_tokenize(text) pos_tag = nltk.pos_tag(word) print (pos_tag)
[OUTPUT]: [('the', 'DT'), ('dogs', 'NNS'), ('are', 'VBP'), ('barking', 'VBG'), ('outside', 'IN'), ('.', '.')]
10. Named Entity Recognition
NER(Named Entity Recognition) is the process of getting the entity names
import nltk text = "who is Barrack Obama" word = nltk.word_tokenize(text) pos_tag = nltk.pos_tag(word) chunk = nltk.ne_chunk(pos_tag) NE = [ " ".join(w for w, t in ele) for ele in chunk if isinstance(ele, nltk.Tree)] print(NE)
[OUTPUT]: ['Barrack Obama']
PS: Execute all those codes and tada!!! You know the basics of NLP ????
You can also try some mini projects like:
- Extracting keywords of documents, articles.
- Generating part of speech for phrases.
- Getting the top used words among all documents.
Also, here is a list of Open-Source NLP Projects in Python that you might want to try out.
Author: Pema Gurung
Pema Gurung is a ML/NLP engineer currently working at EKbana.
Join Udacity’s Natural Language Processing (NLP) Nanodegree and become an expert NLP engineer: Enroll in the NLP Nanodegree today!
Thanks a lot! 🙂 Yours is the only article that actually runs. Thanks for all the help, really appreciate it 🙂 🙂