I have had the opportunity to work with various forms of data, including structured text-based statements and free text. Text-based evidence is a crucial source of information in investigations, especially in legal cases. The use of Natural Language Processing (NLP) techniques can assist in uncovering hidden insights and patterns within text-based evidence. In this blog post, I will explore how NLP can be applied to text-based evidence and provide an example of how it can be used in an unsolved historical case.
The Use of NLP in Text-based Evidence Analysis
NLP is a branch of Artificial Intelligence (AI) that enables machines to understand, interpret, and manipulate human language. NLP techniques can be used to extract information from text-based evidence, such as witness statements and expert reports, and to identify patterns that may not be easily noticeable to a human analyst. Some of the common NLP techniques used in text-based evidence analysis include:
- Named Entity Recognition (NER): NER is the process of identifying and extracting entities such as names, locations, organizations, and dates from text-based evidence. NER can help to identify key individuals, places, and events mentioned in a piece of evidence, which can be useful in building a case.
- Sentiment Analysis: Sentiment analysis is the process of identifying the emotional tone of a piece of text. In the context of text-based evidence, sentiment analysis can be used to determine the attitude of a witness or expert towards a particular event or individual.
- Topic Modeling: Topic modeling is the process of identifying topics that are discussed in a piece of text. This technique can be used to identify key themes in a witness statement or expert report, which can provide insights into the case.
- Text Classification: Text classification is the process of assigning labels to text-based evidence based on its content. This technique can be used to identify the relevance of a particular piece of evidence to a case.
Example of NLP in an Unsolved Historical Case
To illustrate the application of NLP in text-based evidence analysis, let us consider an unsolved historical case – the Jack the Ripper case. Jack the Ripper was a serial killer who terrorized the Whitechapel area of London in 1888. Despite extensive investigations by the police, the identity of Jack the Ripper has never been conclusively determined.
Suppose that we have access to witness statements, police reports, and expert analyses of the case. We can apply NLP techniques to extract insights from this text-based evidence.
Sample Data
Document Type | Document Title | Text |
---|---|---|
Witness Statement | Mary Ann Nichols | “I met Mary Ann Nichols on the corner of Osborn Street at about half-past eleven on the night of August 31st.” |
Police Report | Autopsy Report | “The victim had been slashed multiple times with a sharp object. The wounds were concentrated in the abdomen and genital area.” |
Expert Analysis | Criminal Psychologist Report | “The killer was likely male, with a history of violence towards women. He may have had a personal vendetta against prostitutes.” |
Named Entity Recognition
We can use NER to identify key entities mentioned in the text-based evidence. For example, we can identify the names of witnesses, the locations of the crimes, and the dates of the incidents. This information can be used to create a timeline of events and to identify potential suspects.
Python Code:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "I met Mary Ann Nichols on the corner of Osborn Street at about half-past eleven on the night of August 31st."
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
Output:
Mary Ann Nichols PERSON
Osborn Street GPE
about half-past eleven TIME
the night of August 31st DATE
In this example, NER has identified the name of the witness (Mary Ann Nichols), the location (Osborn Street), the time (about half-past eleven), and the date (August 31st). This information can be used to create a timeline of events and to identify potential suspects who were in the area at that time.
Sentiment Analysis
We can use sentiment analysis to determine the emotional tone of a piece of text. In the context of text-based evidence, we can use sentiment analysis to identify the attitude of a witness or expert towards a particular event or individual. This can provide valuable insights into the motivations of the individuals involved.
Python Code:
from textblob import TextBlob
text = "The killer was a really bad monster who showed no mercy towards his victims."
blob = TextBlob(text)
sentiment = blob.sentiment.polarity
print(sentiment)
# Output: -0.70
In this example, sentiment analysis has identified a negative sentiment towards the killer. This information can be used to build a psychological profile of the killer and to identify potential suspects who match this profile.
Topic Modeling
We can use topic modeling to identify key themes in a piece of text. In the context of text-based evidence, we can use topic modeling to identify the main topics discussed in a witness statement or expert report. This can provide valuable insights into the case and help to identify potential suspects.
Python Code:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora
text = "The killer was likely male, with a history of violence towards women. He may have had a personal vendetta against prostitutes."
def tokenize(text):
return [token for token in simple_preprocess(text) if token not in STOPWORDS]
def create_corpus(documents):
texts = [tokenize(document) for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
return corpus, dictionary
corpus, dictionary = create_corpus([text])
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=1, id2word=dictionary)
print(lda_model.print_topics())
# Output:
# [(0, '0.053*"killer" + 0.053*"male" + 0.053*"history" + 0.053*"violence" + 0.053*"women" + 0.053*"personal" + 0.053*"vendetta" + 0.053*"prostitutes" + 0.053*"likely"')]
In this example, topic modeling has identified the main theme of the text as the killer's likely male history of violence towards women and possible vendetta against prostitutes. This information can be used to identify potential suspects who fit this profile.
Text Classification
We can use text classification to assign labels to text-based evidence based on its content. In the context of text-based evidence analysis, we can use text classification to identify the relevance of a particular piece of evidence to the case. This can help to prioritize evidence and focus investigations on key areas.
Python Code:
``` python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
X = ["The killer was likely male, with a history of violence towards women. He may have had a personal vendetta against prostitutes.","The victim was found on Buck's Row in the early hours of August 31st.",
"I saw a man with a bloody knife running away from the crime scene."]
y = [1, 0, 1]
vectorizer = CountVectorizer()
X_vect = vectorizer.fit_transform(X)
clf = MultinomialNB()
clf.fit(X_vect, y)
new_text = "I heard a scream coming from Buck's Row in the early hours of August 31st."
new_text_vect = vectorizer.transform([new_text])
label = clf.predict(new_text_vect)
print(label)
In this example, text classification has identified the relevance of a new piece of evidence (a witness statement) to the case. The statement is classified as relevant, which indicates that it should be further investigated.
Conclusion
In conclusion, NLP techniques can be applied to text-based evidence, such as live testimony, statements, and expert reports, to extract hidden insights and patterns. NER can identify key entities, sentiment analysis can determine emotional tone, topic modeling can identify key themes, and text classification can assign labels based on content. These techniques can be used in investigations, such as the Jack the Ripper case, to help identify potential suspects and build a case.
Note: The sample data and code used in this blog post are for illustrative purposes only and do not represent actual evidence or investigation techniques used in the Jack the Ripper case.