PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0
Showing posts with label nlp. Show all posts
Showing posts with label nlp. Show all posts

Wednesday, November 2, 2022

[FIXED] How to perform same operation on multiple text files and save the output in different files using python?

 November 02, 2022     file, for-loop, nlp, python, text-files     No comments   

Issue

I have written a code which extracts stop words from a text file and outputs two new text files. One file contains the stop words from that text file and another file contains the data without stop words. Now I have more than 100 text file in a folder, I would like to perform the same operation on all those file simultaneously.

For example there is a Folder A which contains 100 text file the code should be executed on all those text files simultaneously. The output should be two new text files such as 'Stop_Word_Consist_Filename.txt' and 'Stop_word_not_Filename.txt' which should be stored in a separate folder.That means for every 100 text files there will 200 output text files stored in a new folder. Please note the 'Filename' in both these output file is the actual name of the text file meaning 'Walmart.txt' should have 'Stop_Word_Consist_Walmart.txt' and 'Stop_word_not_Walmart.txt'. I did try few things and I know loop in involved giving the path directory but I didn't get any success.

Apologies for such a long question.

Following is the code for 1 file.

import numpy as np
import pandas as pd

# Pathes of source files and that for after-modifications
files_path = os.getcwd()
# another folder, your should create first to store files after modifications in
files_after_path = os.getcwd() + '/' + 'Stopwords_folder'
os.makedirs(files_after_path, exist_ok=True)
text_files = os.listdir(files_path)
data = pd.DataFrame(text_files)
data.columns = ["Review_text"]

import re
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

def clean_text(df):
    all_reviews = list()
    #lines = df["Review_text"].values.tolist()
    lines = data.values.tolist()

    for text in lines:
        #text = text.lower()
        text = [word.lower() for word in text]

        pattern = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
        text = pattern.sub('', str(text))
        
        emoji = re.compile("["
                           u"\U0001F600-\U0001FFFF"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
        text = emoji.sub(r'', text)
        
        text = re.sub(r"i'm", "i am", text)
        text = re.sub(r"he's", "he is", text)
        text = re.sub(r"she's", "she is", text)
        text = re.sub(r"that's", "that is", text)        
        text = re.sub(r"what's", "what is", text)
        text = re.sub(r"where's", "where is", text) 
        text = re.sub(r"\'ll", " will", text)  
        text = re.sub(r"\'ve", " have", text)  
        text = re.sub(r"\'re", " are", text)
        text = re.sub(r"\'d", " would", text)
        text = re.sub(r"\'ve", " have", text)
        text = re.sub(r"won't", "will not", text)
        text = re.sub(r"don't", "do not", text)
        text = re.sub(r"did't", "did not", text)
        text = re.sub(r"can't", "can not", text)
        text = re.sub(r"it's", "it is", text)
        text = re.sub(r"couldn't", "could not", text)
        text = re.sub(r"have't", "have not", text)
        
        text = re.sub(r"[,.\"!@#$%^&*(){}?/;`~:<>+=-]", "", text)
        tokens = word_tokenize(text)
        table = str.maketrans('', '', string.punctuation)
        stripped = [w.translate(table) for w in tokens]
        words = [word for word in stripped if word.isalpha()]
        stop_words = set(stopwords.words("english"))
        stop_words.discard("not")
        PS = PorterStemmer()
        words = [PS.stem(w) for w in words if not w in stop_words]
        words = ' '.join(words)
        all_reviews.append(words)
    return all_reviews,stop_words

for entry in data:
    #all_reviews , stop_words = clean_text(entry)
    for r in all_reviews: 
        if not r in stop_words: 
            appendFile = open(f'No_Stopwords{entry}.txt','a') 
            appendFile.write(" "+r) 
            appendFile.close() 
    
    for r in stop_words: 
        appendFile = open(f'Stop_Word_Consist{entry}.txt','a') 
        appendFile.write(" "+r) 
        appendFile.close() 
        
    all_reviews , stop_words = clean_text(entry)

UPDATE :

So I have made changes to the code. I did got two output files Stop_Word_Consist and No_Stop_word. But I am not getting the required Data inside. Meaning Stop_word consist does not have the stop words I am looking for. I am pretty sure I made some mistakes in indentation. I would appreciate the help.


Solution

You can use OS.listdir to get the number of text files, and use a for loop to run each time. To assign a name to the output file you can use an f-string in its creation so it looks like f'Stop_Word_Consist_{fileName}':

for entry in OS.listdir(folder location):
    all_reviews , stop_words = clean_text(data_1)
    all_reviews[:]

for r in all_reviews: 
    if not r in stop_words: 
    appendFile = open('Stop_Word_hrb02-phil-usa.txt.txt','a') 
    appendFile.write(" "+r) 
    appendFile.close() 

for r in stop_words: 
    appendFile = open(f'Stop_Word_Consist{entry}.txt','a') 
    appendFile.write(" "+r) 
    appendFile.close()


Answered By - Le_Me
Answer Checked By - Terry (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Friday, October 7, 2022

[FIXED] How to verify if two text datasets are from different distribution?

 October 07, 2022     chi-squared, data-analysis, machine-learning, nlp, statistics     No comments   

Issue

I have two text datasets. Each dataset consists of multiple sequences and each sequence can have more than one sentence.

How do I measure if both datasets are from same distribution?

The purpose is to verify transfer learning from one distribution to another only if the difference between the distributions is statistically significant.

I am panning to use chi-square test but not sure if it will help for text data considering the high degrees of freedom.

update: Example: Supppose I want to train a sentiment classification model. I train a model on IMDb dataset and evaluate on IMDb and Yelp datasets. I found that my model trained on IMDb still does well on Yelp. But the question is how different these datasets are?

Train Dataset : https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Train.csv

Eval 1: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Valid.csv

Eval 2: https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset

Now,

  1. How different are train and eval 1?
  2. How different are train and eval 2?
  3. Is the dissimilarity between train and eval 2 by chance ? What is the statistical significance and p value?

Solution

The question "are text A and text B coming from the same distribution?" is somehow poorly defined. For example, these two questions (1,2) can be viewed as generated from the same distribution (distribution of all questions on StackExchange) or from different distributions (distribution of two different subdomains of StackExchange). So it's not clear what is the property that you want to test.

Anyway, you can come up with any test statistic of your choice, approximate its distribution in case of "single source" by simulation, and calculate the p-value of your test.

As a toy example, let's take two small corpora: two random articles from English Wikipedia. I'll do it in Python

import requests
from bs4 import BeautifulSoup
urls = [
    'https://en.wikipedia.org/wiki/Nanjing_(Liao_dynasty)', 
    'https://en.wikipedia.org/wiki/United_States_Passport_Card'
]
texts = [BeautifulSoup(requests.get(u).text).find('div', {'class': 'mw-parser-output'}).text for u in urls]

Now I use a primitive tokenizer to count individual words in texts, and use root mean squared difference in word relative frequencies as my test statistic. You can use any other statistic, as long as you calculate it consistently.

import re
from collections import Counter
from copy import deepcopy
TOKEN = re.compile(r'([^\W\d]+|\d+|[^\w\s])')
counters = [Counter(re.findall(TOKEN, t)) for t in texts]
print([sum(c.values()) for c in counters])  
# [5068, 4053]: texts are of approximately the same size

def word_freq_rmse(c1, c2):
    result = 0
    vocab = set(c1.keys()).union(set(c2.keys()))
    n1, n2 = sum(c1.values()), sum(c2.values())
    n = len(vocab)
    for word in vocab:
        result += (c1[word]/n1 - c2[word]/n2)**2 / n
    return result**0.5

print(word_freq_rmse(*counters))
# rmse is 0.001178, but is this a small or large difference?

I get a value of 0.001178, but I don't know whether it's a large difference. So I need to simulate the distribution of this test statistic under the null hypothesis: when both texts are from the same distribution. To simulate it, I merge two texts into one, and then split them randomly, and calculate my statistic when comparing these two random parts.

import random
tokens = [tok for t in texts for tok in re.findall(TOKEN, t)]
split = sum(counters[0].values())
distribution = []
for i in range(1000):
    random.shuffle(tokens)
    c1 = Counter(tokens[:split])
    c2 = Counter(tokens[split:])
    distribution.append(word_freq_rmse(c1, c2))

Now I can see how unusual is the value of my observed test statistic under the null hypothesis:

observed = word_freq_rmse(*counters)
p_value = sum(x >= observed for x in distribution) / len(distribution)
print(p_value)  # it is 0.0
print(observed, max(distribution), sum(distribution) / len(distribution)) # 0.0011  0.0006 0.0004

We see that when texts are from the same distribution, my test statistic is on average 0.0004 and almost never exceeds 0.0006, so the value of 0.0011 is very unusual, and the null hypothesis that two my texts originate from the same distribution should be rejected.



Answered By - David Dale
Answer Checked By - Mary Flores (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Saturday, July 9, 2022

[FIXED] How can I use Watson NLP to analyze Keywords with JS?

 July 09, 2022     ibm-watson, javascript, keyword, nlp, p5.js     No comments   

Issue

I am trying to create a keyword analysis using Watson NLP and JS.

I tried the following code line but the result says ReferrenceError{} and I have no idea on how to make it work..

var keywords=response.result.keywords;
  print(keywords);
  createElement("h3", "Main keywords of this synopsis");
  
  nbkeywords = 3;
  createP("Keywords in this synopsis are:");
  createP(keywords[i].text);
 }

Solution

This is an example of JSON response from the keywords feature of the Watson NLU API (reference):

{
  "usage": {
    "text_units": 1,
    "text_characters": 1536,
    "features": 1
  },
  "keywords": [
    {
      "text": "curated online courses",
      "sentiment": {
        "score": 0.792454
      },
      "relevance": 0.864624,
      "emotions": {
        "sadness": 0.188625,
        "joy": 0.522781,
        "fear": 0.12012,
        "disgust": 0.103212,
        "anger": 0.106669
      }
    },
    {
      "text": "free virtual server",
      "sentiment": {
        "score": 0.664726
      },
      "relevance": 0.864593,
      "emotions": {
        "sadness": 0.265225,
        "joy": 0.532354,
        "fear": 0.07773,
        "disgust": 0.090112,
        "anger": 0.102242
      }
    }
  ],
  "language": "en",
  "retrieved_url": "https://www.ibm.com/us-en/"
}

Meaning that the "keywords" key in the JSON response is an array containing other JSON objects. To print all keywords you need to loop this array, like shown below with the use of a "for" statement:

var keywords = response.result.keywords;
...
createElement("h3", "Main keywords of this synopsis");
createP("Keywords in this synopsis are:");
var numberOfKeywords = keywords.length;
for (var i = 0; i < numberOfKeywords; i++) {
    createP(keywords[i].text);
}

In the official Watson NLU documentation there are Javascript examples that could also help you understand the service API. See https://cloud.ibm.com/apidocs/natural-language-understanding?code=node#keywords.

I hope this answer helps you.



Answered By - Vanderlei Munhoz
Answer Checked By - David Goodson (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Older Posts Home
View mobile version

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
All Comments
Atom
All Comments

Copyright © PHPFixing