PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0
Showing posts with label subset. Show all posts
Showing posts with label subset. Show all posts

Tuesday, November 22, 2022

[FIXED] How select area/polygon (lat, long) inside data frame R

 November 22, 2022     dataframe, multiple-conditions, r, select, subset     No comments   

Issue

I have one data frame with more than 90.0000 locations (latitude and longitude) with dive depths, and I want 2 things, please.

#I wrote this example above to who gonna help can try at your R program.

df = data.table(dive = c(10, 15, 20, 50, 70, 80, 90, 40, 100, 40, 40, 
                         50, 67, 45, 70, 30),
                lat = c(-23, -24, -25, -26, -27, -28, -29, -30, -32, 
                        -33, -34, -35, -36, -37, -38, -39),
                lon = c(-44, -43, -42, -41, -40, -39, -38, -35, -30, 
                        -28, -25, -23, -20, -19, -18, -15))

#the class of all of this is numeric

-First

So that you can understand better, I have this map, the circles are the depth of the dive in a given location:

enter image description here

I have dive depths and I want select one specific area (latitude = -36, latitude = - 27, longitude = -27 and longitude = -40).

And I want select the dive depths inside this area in blue at data frame:

enter image description here

- Second

Now, I need to select the inverse area, in green:

enter image description here:

WHAT I TRIED

I tried to do this to select the "blue" area:

df2<-df[df$lat <= -36 & df$lat >= -27 & df$lon >= -27 & df$lon <= -40]
#this return the data frame with no variables and with all locations
 

#And I tried inserting a comma at the end

df2<-df[df$lat <= -36 & df$lat >= -27 & df$lon >= -27 & df$lon <= -40,]
#this return the data frame with no observations 

Someone know how to do this? Thanks!


Solution

You can use :

blue_area <- subset(df, lat >= -36 & lat <= -27 & lon <= -27 & lon >= -40)
green_area <- dplyr::anti_join(df, blue_area)


Answered By - Ronak Shah
Answer Checked By - Katrina (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Monday, November 7, 2022

[FIXED] How Set.contains() decides whether it's a subset or not?

 November 07, 2022     hashset, iterator, java, set, subset     No comments   

Issue

I expect the following code would give me a subset and a complementary set.

But actually, the result shows that "Error: This is not a subset!"

What it.next() get and how to revise my code to get the result I want? Thanks!

package Chapter8;

import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

public class Three {
    int n;
    Set<Integer> set = new HashSet<Integer>();

    public static void main(String args[]) {
        Three three = new Three(10);
        three.display(three.set);
        Set<Integer> test = new HashSet<Integer>();
        Iterator<Integer> it = three.set.iterator();
        while(it.hasNext()) {
            test.add(it.next());
            three.display(test);
            three.display(three.complementarySet(test));
        }

    }

    boolean contains(Set<Integer> s) {
        if (this.set.contains(s))
            return true;
        else 
            return false;
    }

    Set<Integer> complementarySet(Set<Integer> s) {
        if(this.set.contains(s)){
            Set<Integer> result = this.set;
            result.removeAll(s);
            return result;
        }
        else {
            System.out.println("Error: This is not a subset!");
            return null;
        }
    }

    Three() {
        this.n = 3;
        this.randomSet();
    }

    Three(int n) {
        this.n = n;
        this.randomSet();
    }

    void randomSet() {
        while(set.size() < n) {
            set.add((int)(Math.random()*10));
        }
    }

    void display(Set<Integer> s) {
        System.out.println("The set is " + s.toString());
    }
}

Solution

Your problem is in this part:

set.contains(s)

that doesn't do what you think it does, it doesn't take as an argument another Set to see if its members are contained in the firstset. It rather looks if the argument passed it is in the Set.

You need to iterate over the "contained" set and use set.contains(element) for each element in the contained set.



Answered By - morgano
Answer Checked By - Willingham (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Wednesday, November 2, 2022

[FIXED] How to increase speed on finding an index?

 November 02, 2022     indexing, pandas, python, subset     No comments   

Issue

I have two dataframes: df2

Time(s) vehicle_id
10          1
11          1
12          1
13          1
...         ...
78          2
79          2
80          2

And com:

index    v1    v2
  1       1    2

I would like to find out the index of the row which matches the following condition:

if (abs(max(df2.loc[df2['vehicle_id'] == com['v1'][i], 'time(s)']) - max(df2.loc[df2['vehicle_id'] == com['v2'][i], 'time(s)']))>60.0):
    arr.append(i)

In this case, the condition is true since the max time from v1 is 13 and from v2 is 80, then, 13-80=-67=67 It works, however, it has to compare against more than 40M rows and it is taking about 200k rows/hour, so, it will take hours. Is there a faster way to proceed? I just need to obtain the indexes from com which matches the above condition.

Update: I check all the condition positions using a FOR loop

for i in range(0, len(com)):
if (abs(max(df2.loc[df2['vehicle_id'] == com['v1'][i], 'time(s)']) - max(df2.loc[df2['vehicle_id'] == com['v2'][i], 'time(s)']))>60.0):
        arr.append(i)

Solution

It looks like you only need to consider the max time for each id, so why not sort then drop duplicates to keep only the max time per vehicle id. Then you can replace the vehicle ids in your comm df w/the times and perform your filter conditions on the entire result, saving the resulting indexes to an array.

import pandas as pd

df2 = pd.DataFrame({'Time(s)': [10, 11, 12, 13, 78, 79, 80], 'vehicle_id': [1, 1, 1, 1, 2, 2, 2]})
comm = pd.DataFrame({'index': [1], 'v1': [1], 'v2': [2]})

m = df.sort_values(by='Time(s)').drop_duplicates(subset='vehicle_id', keep='last')
comm = comm.set_index('index').replace(m.set_index('vehicle_id')['Time(s)'].to_dict())
arr = comm.loc[(comm['v1']-comm['v2']).abs().gt(60)].index.values

print(arr)


Answered By - Chris
Answer Checked By - Clifford M. (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Saturday, October 29, 2022

[FIXED] How to clean double lines from joint statement in R?

 October 29, 2022     left-join, r, subset     No comments   

Issue

Having done a join operation to compare addresses with itself.

library(tidyverse)
library(lubridate)
library(stringr)
library(stringdist)
library(fuzzyjoin)
doTheJoin <- function (threshold) {
      joined <- trimData(d_client_squashed) %>% 
        stringdist_left_join(
          trimData(d_client_squashed), 
          by = c(address_full="address_full"),
          distance_col = "distance",
          max_dist = threshold,
          method = "jw"
        )
    }    

The structure of d_client_squashed is the following and contains string values:

Client_Reference adress_full
C01 Client1 Name, Street, Zipcode, Town
C02 Client2 Name, Street2, Zipcode2, Town2
... ...

The following operation:

sensible_matches <- doTheJoin(0.2)
View(sensible_matches %>% filter(Client_Reference.x != Client_Reference.y))

Results in the following output:

Client_Reference.x address_full.x Client_Reference.y address_full.y Distance
C01 Client1 Name, Street, Zipcode, Town C02 Client2 Name, Street2, Zipcode2, Town2 0.05486
C02 Client2 Name, Street2, Zipcode2, Town2 C01 Client1 Name, Street, Zipcode, Town 0.05486
... ... ... ... ...

The output of this join operation is double with reversed client information. The distance value is not unique. How can I subset the data frame to avoid those double lines?


Solution

In order to remove the rows containing the same data, you can order them based on the contained elements, so there is not difference between rows containing the same pair of Client_Reference, and then delete the duplicates. After that you can filter the ones containing the same Client_Reference as you did.

sensible_matches <- sensible_matches[!duplicated(t(apply(sensible_matches,1,sort))),]
View(sensible_matches  %>% filter(Client_Reference.x != Client_Reference.y))


Answered By - Giulio Mattolin
Answer Checked By - Gilberto Lyons (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Friday, October 7, 2022

[FIXED] How to subset a column for each dataframe in RMSE calculation in R?

 October 07, 2022     calculation, dataframe, r, statistics, subset     No comments   

Issue

I have two dataframes Vobs and Vest. See the example below:

dput(head(Vobs,20))
structure(list(ID = c("LAM_1", "LAM_2", "LAM_3", "LAM_4", "LAM_5", 
"LAM_6", "LAM_7", "AUR_1", "AUR_2", "AUR_3", "AUR_4", "AUR_5", 
"AUR_6"), SOS = c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 
26), EOS = c(3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27)), row.names = c(NA, 
-13L), class = c("tbl_df", "tbl", "data.frame"))
dput(head(Vest,30))
structure(list(ID = c("LAM", "LAM", "LAM", "LAM", "LAM", "AUR", 
"AUR", "AUR", "AUR", "AUR", "AUR", "P0", "P01", "P01", "P02", 
"P1", "P2", "P3", "P4", "P13", "P14", "P15", "P17", "P18", "P19", 
"P20", "P22", "P23", "P24"), EVI_SOS = c(2, 6, 10, 14, NA, 20, 
24, 28, 32, 36, NA, 42, 42, NA, 48, 48, 52, 56, 60, 64, 68, NA, 
NA, 72, NA, 78, 82, 86, 90), EVI_EOS = c(3, 7, 11, 15, NA, 21, 
25, 29, 33, 37, NA, 43, 43, NA, 49, 49, 53, 57, 61, 65, 69, NA, 
NA, 73, NA, 79, 83, 87, 91), NDVI_SOS = c(4, 8, 12, 16, 18, 22, 
26, 30, 34, 38, 40, 44, 44, 46, 50, 50, 54, 58, 62, 66, 70, NA, 
NA, 74, 76, 80, 84, 88, 92), NDVI_EOS = c(5, 9, 13, 17, 19, 23, 
27, 31, 35, 39, 41, 45, 45, 47, 51, 51, 55, 59, 63, 67, 71, NA, 
NA, 75, 77, 81, 85, 89, 93)), row.names = c(NA, -29L), class = c("tbl_df", 
"tbl", "data.frame"))

I want to do the root mean square error (RMSE) between the two dataframes. As an example, I pretend to do the RMSE between SOS column of Vobs and EVI_SOS column of Vest concerning the LAM ID (which exists in both dataframes). In other words, I want to subset the data for the ID of interest. In this example, I'm interested in the LAM ID, for Vest and LAM_3 to LAM_7 (that is LAM_3, LAM_4, LAM_5, LAM_6, LAM_7) for Vobs.

I have been using this code:

sqrt(mean((Vobs$SOS - Vest$EVI_SOS)^2, na.rm=TRUE))

but I missed the ID subset for both columns of the two different dataframes. How can I do the subset using this code?

Any help will be much appreciated.


Solution

You could get the subsets of the relevant data as:

library(stringr)
diff <- subset(Vobs, ID %in% paste0("LAM_", 3:7))$SOS - 
  subset(Vest, str_detect(ID, "LAM"))$EVI_SOS
sqrt(mean(diff^2, na.rm=TRUE))
#> [1] 2.44949


Answered By - DaveArmstrong
Answer Checked By - Cary Denson (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Thursday, October 6, 2022

[FIXED] How to run a t-test on a subset of data

 October 06, 2022     r, statistics, subset, t-test     No comments   

Issue

First, is this dataset in a tidy form for a t-test?

https://i.stack.imgur.com/tMK6R.png

Second, I'm trying to do a two sample t-test to compare the means at time 3 of treatment a and b for 'outcome 1'. How would I go about doing this?

Sample data:

df <- structure(list(code = c(100, 100, 100, 101, 101, 101, 102, 102, 
      102, 103, 103, 103), treatment = c("a", "a", "a", "b", "b", "b", 
      "a", "a", "a", "b", "b", "b"), sex = c("f", "f", "f", "m", "m", 
      "m", "f", "f", "f", "f", "f", "f"), time = c(1, 2, 3, 1, 2, 3, 
      1, 2, 3, 1, 2, 3), `outcome 1` = c(21, 23, 33, 44, 45, 47, 22, 
      34, 22, 55, 45, 56), `outcome 2` = c(21, 32, 33, 33, 44, 45, 
      22, 57, 98, 65, 42, 42), `outcome 3` = c(62, 84, 63, 51, 45, 
      74, 85, 34, 96, 86, 45, 47)), .Names = c("code", "treatment", 
      "sex", "time", "outcome 1", "outcome 2", "outcome 3"), 
      class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L))

Solution

First you'll have to define the subsets you want tested, then you can run the t-test. You don't have to necessarily store the subsets in variables as I've done, but it makes the t-test output clearer.

Normally with t-test questions, I'd recommend the help provided by ?t.test, but since this involves more complex subsetting, I've included how to do that here:

var_a <- df$`outcome 1`[df$treatment=="a" & df$time==3]
var_b <- df$`outcome 1`[df$treatment=="b" & df$time==3]

t.test(var_a,var_b)

Output:

    Welch Two Sample t-test

data:  var_a and var_b
t = -3.3773, df = 1.9245, p-value = 0.08182
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -55.754265   7.754265
sample estimates:
mean of x mean of y 
     27.5      51.5 


Answered By - www
Answer Checked By - Robin (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Tuesday, May 17, 2022

[FIXED] How to subset multiple columns from df including grep match

 May 17, 2022     dataframe, match, partial, r, subset     No comments   

Issue

I have a very large data set that includes multiple columns with common portions of their names (e.g ctq_1, ctq_2, ctq_3 and also panas_1, panas_2, panas_3). I'd like to subset some of those columns (e.g. only those containing 'panas' in the column name) alongside certain other columns from the same data frame that have unique names (e.g. id, group).

I tried using a grep function inside square brackets, which worked nicely: panas <- bigdata[ , grep('panas', colnames(bigdata))] but now I need to work out how to also include the other two columns that I need, which are id and group. I tried: panas <- bigdata[ , c('id', 'group', grep('panas', colnames(bigdata)))] but I get this error: Error: Can't find columns 114, 115, 116, 117, 118, … (and 15 more) in .data. Call rlang::last_error() to see a backtrace.

How can I achieve what I want to with the simplest code possible? I am an R newbie so avoiding fancy functions would be ideal!

Here is a reproducible example.


> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> newframe <- iris[ , grep('Petal', colnames(iris))] # This works

> newframe <- iris[ , c('Species', grep('Petal', colnames(iris)))] # This doesn't work

This time, the error is:

Error in [.data.frame(iris, , c("Species", grep("Petal", colnames(iris)))) : undefined columns selected


Solution

Assuming I understood what you would like to do, a possible solution that may not be useful and/or may be redundant:

my_selector <- function(df,partial_name,...){
  positional_names <- match(...,names(df))
  df[,c(positional_names,grep(partial_name,names(df)))]
}
my_selector(iris, partial_name = "Petal","Species")

A "simpler" option would be to use grep and the like to match the target names at once:

iris[grep("Spec.*|Peta.*", names(iris))]

Or even simpler, as suggested by @akrun , we can simply do:

iris[grep("(Spec|Peta).*", names(iris))]

For more columns, we could do something like:

my_selector(iris, partial_name = "Petal",c("Species","Sepal.Length"))
       Species Sepal.Length Petal.Length Petal.Width
1       setosa          5.1          1.4         0.2
2       setosa          4.9          1.4         0.2

Note however that in the above function, the columns are selected counter-intuitively in that the names supplied last are selected first.

Result for the first part(truncated):

         Species Petal.Length Petal.Width
1       setosa          1.4         0.2
2       setosa          1.4         0.2
3       setosa          1.3         0.2
4       setosa          1.5         0.2
5       setosa          1.4         0.2
6       setosa          1.7         0.4
7       setosa          1.4         0.3


Answered By - NelsonGon
Answer Checked By - Senaida (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Older Posts Home

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
All Comments
Atom
All Comments

Copyright © PHPFixing