Showing posts with label subset. Show all posts

Tuesday, November 22, 2022

[FIXED] How select area/polygon (lat, long) inside data frame R

November 22, 2022 dataframe, multiple-conditions, r, select, subset No comments

Issue

I have one data frame with more than 90.0000 locations (latitude and longitude) with dive depths, and I want 2 things, please.

#I wrote this example above to who gonna help can try at your R program.

df = data.table(dive = c(10, 15, 20, 50, 70, 80, 90, 40, 100, 40, 40, 
                         50, 67, 45, 70, 30),
                lat = c(-23, -24, -25, -26, -27, -28, -29, -30, -32, 
                        -33, -34, -35, -36, -37, -38, -39),
                lon = c(-44, -43, -42, -41, -40, -39, -38, -35, -30, 
                        -28, -25, -23, -20, -19, -18, -15))

#the class of all of this is numeric

-First

So that you can understand better, I have this map, the circles are the depth of the dive in a given location:

I have dive depths and I want select one specific area (latitude = -36, latitude = - 27, longitude = -27 and longitude = -40).

And I want select the dive depths inside this area in blue at data frame:

- Second

Now, I need to select the inverse area, in green:

WHAT I TRIED

I tried to do this to select the "blue" area:

df2<-df[df$lat <= -36 & df$lat >= -27 & df$lon >= -27 & df$lon <= -40]
#this return the data frame with no variables and with all locations
 

#And I tried inserting a comma at the end

df2<-df[df$lat <= -36 & df$lat >= -27 & df$lon >= -27 & df$lon <= -40,]
#this return the data frame with no observations

Someone know how to do this? Thanks!

Solution

You can use :

blue_area <- subset(df, lat >= -36 & lat <= -27 & lon <= -27 & lon >= -40)
green_area <- dplyr::anti_join(df, blue_area)

Answered By - Ronak Shah

Answer Checked By - Katrina (PHPFixing Volunteer)

[FIXED] How Set.contains() decides whether it's a subset or not?

November 07, 2022 hashset, iterator, java, set, subset No comments

Issue

I expect the following code would give me a subset and a complementary set.

But actually, the result shows that "Error: This is not a subset!"

What it.next() get and how to revise my code to get the result I want? Thanks!

package Chapter8;

import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

public class Three {
    int n;
    Set<Integer> set = new HashSet<Integer>();

    public static void main(String args[]) {
        Three three = new Three(10);
        three.display(three.set);
        Set<Integer> test = new HashSet<Integer>();
        Iterator<Integer> it = three.set.iterator();
        while(it.hasNext()) {
            test.add(it.next());
            three.display(test);
            three.display(three.complementarySet(test));
        }

    }

    boolean contains(Set<Integer> s) {
        if (this.set.contains(s))
            return true;
        else 
            return false;
    }

    Set<Integer> complementarySet(Set<Integer> s) {
        if(this.set.contains(s)){
            Set<Integer> result = this.set;
            result.removeAll(s);
            return result;
        }
        else {
            System.out.println("Error: This is not a subset!");
            return null;
        }
    }

    Three() {
        this.n = 3;
        this.randomSet();
    }

    Three(int n) {
        this.n = n;
        this.randomSet();
    }

    void randomSet() {
        while(set.size() < n) {
            set.add((int)(Math.random()*10));
        }
    }

    void display(Set<Integer> s) {
        System.out.println("The set is " + s.toString());
    }
}

Solution

Your problem is in this part:

set.contains(s)

that doesn't do what you think it does, it doesn't take as an argument another Set to see if its members are contained in the firstset. It rather looks if the argument passed it is in the Set.

You need to iterate over the "contained" set and use set.contains(element) for each element in the contained set.

Answered By - morgano

Answer Checked By - Willingham (PHPFixing Volunteer)

[FIXED] How to increase speed on finding an index?

November 02, 2022 indexing, pandas, python, subset No comments

Issue

I have two dataframes: df2

Time(s) vehicle_id
10          1
11          1
12          1
13          1
...         ...
78          2
79          2
80          2

And com:

index    v1    v2
  1       1    2

I would like to find out the index of the row which matches the following condition:

if (abs(max(df2.loc[df2['vehicle_id'] == com['v1'][i], 'time(s)']) - max(df2.loc[df2['vehicle_id'] == com['v2'][i], 'time(s)']))>60.0):
    arr.append(i)

In this case, the condition is true since the max time from v1 is 13 and from v2 is 80, then, 13-80=-67=67 It works, however, it has to compare against more than 40M rows and it is taking about 200k rows/hour, so, it will take hours. Is there a faster way to proceed? I just need to obtain the indexes from com which matches the above condition.

Update: I check all the condition positions using a FOR loop

for i in range(0, len(com)):
if (abs(max(df2.loc[df2['vehicle_id'] == com['v1'][i], 'time(s)']) - max(df2.loc[df2['vehicle_id'] == com['v2'][i], 'time(s)']))>60.0):
        arr.append(i)

Solution

It looks like you only need to consider the max time for each id, so why not sort then drop duplicates to keep only the max time per vehicle id. Then you can replace the vehicle ids in your comm df w/the times and perform your filter conditions on the entire result, saving the resulting indexes to an array.

import pandas as pd

df2 = pd.DataFrame({'Time(s)': [10, 11, 12, 13, 78, 79, 80], 'vehicle_id': [1, 1, 1, 1, 2, 2, 2]})
comm = pd.DataFrame({'index': [1], 'v1': [1], 'v2': [2]})

m = df.sort_values(by='Time(s)').drop_duplicates(subset='vehicle_id', keep='last')
comm = comm.set_index('index').replace(m.set_index('vehicle_id')['Time(s)'].to_dict())
arr = comm.loc[(comm['v1']-comm['v2']).abs().gt(60)].index.values

print(arr)

Answered By - Chris

Answer Checked By - Clifford M. (PHPFixing Volunteer)

[FIXED] How to clean double lines from joint statement in R?

October 29, 2022 left-join, r, subset No comments

Issue

Having done a join operation to compare addresses with itself.

library(tidyverse)
library(lubridate)
library(stringr)
library(stringdist)
library(fuzzyjoin)
doTheJoin <- function (threshold) {
      joined <- trimData(d_client_squashed) %>% 
        stringdist_left_join(
          trimData(d_client_squashed), 
          by = c(address_full="address_full"),
          distance_col = "distance",
          max_dist = threshold,
          method = "jw"
        )
    }

The structure of d_client_squashed is the following and contains string values:

Client_Reference	adress_full
C01	Client1 Name, Street, Zipcode, Town
C02	Client2 Name, Street2, Zipcode2, Town2
...	...

The following operation:

sensible_matches <- doTheJoin(0.2)
View(sensible_matches %>% filter(Client_Reference.x != Client_Reference.y))

Results in the following output:

Client_Reference.x	address_full.x	Client_Reference.y	address_full.y	Distance
C01	Client1 Name, Street, Zipcode, Town	C02	Client2 Name, Street2, Zipcode2, Town2	0.05486
C02	Client2 Name, Street2, Zipcode2, Town2	C01	Client1 Name, Street, Zipcode, Town	0.05486
...	...	...	...	...

The output of this join operation is double with reversed client information. The distance value is not unique. How can I subset the data frame to avoid those double lines?

Solution

In order to remove the rows containing the same data, you can order them based on the contained elements, so there is not difference between rows containing the same pair of Client_Reference, and then delete the duplicates. After that you can filter the ones containing the same Client_Reference as you did.

sensible_matches <- sensible_matches[!duplicated(t(apply(sensible_matches,1,sort))),]
View(sensible_matches  %>% filter(Client_Reference.x != Client_Reference.y))

Answered By - Giulio Mattolin

Answer Checked By - Gilberto Lyons (PHPFixing Admin)

[FIXED] How to subset a column for each dataframe in RMSE calculation in R?

October 07, 2022 calculation, dataframe, r, statistics, subset No comments

Issue

I have two dataframes Vobs and Vest. See the example below:

dput(head(Vobs,20))
structure(list(ID = c("LAM_1", "LAM_2", "LAM_3", "LAM_4", "LAM_5", 
"LAM_6", "LAM_7", "AUR_1", "AUR_2", "AUR_3", "AUR_4", "AUR_5", 
"AUR_6"), SOS = c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 
26), EOS = c(3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27)), row.names = c(NA, 
-13L), class = c("tbl_df", "tbl", "data.frame"))

dput(head(Vest,30))
structure(list(ID = c("LAM", "LAM", "LAM", "LAM", "LAM", "AUR", 
"AUR", "AUR", "AUR", "AUR", "AUR", "P0", "P01", "P01", "P02", 
"P1", "P2", "P3", "P4", "P13", "P14", "P15", "P17", "P18", "P19", 
"P20", "P22", "P23", "P24"), EVI_SOS = c(2, 6, 10, 14, NA, 20, 
24, 28, 32, 36, NA, 42, 42, NA, 48, 48, 52, 56, 60, 64, 68, NA, 
NA, 72, NA, 78, 82, 86, 90), EVI_EOS = c(3, 7, 11, 15, NA, 21, 
25, 29, 33, 37, NA, 43, 43, NA, 49, 49, 53, 57, 61, 65, 69, NA, 
NA, 73, NA, 79, 83, 87, 91), NDVI_SOS = c(4, 8, 12, 16, 18, 22, 
26, 30, 34, 38, 40, 44, 44, 46, 50, 50, 54, 58, 62, 66, 70, NA, 
NA, 74, 76, 80, 84, 88, 92), NDVI_EOS = c(5, 9, 13, 17, 19, 23, 
27, 31, 35, 39, 41, 45, 45, 47, 51, 51, 55, 59, 63, 67, 71, NA, 
NA, 75, 77, 81, 85, 89, 93)), row.names = c(NA, -29L), class = c("tbl_df", 
"tbl", "data.frame"))

I want to do the root mean square error (RMSE) between the two dataframes. As an example, I pretend to do the RMSE between SOS column of Vobs and EVI_SOS column of Vest concerning the LAM ID (which exists in both dataframes). In other words, I want to subset the data for the ID of interest. In this example, I'm interested in the LAM ID, for Vest and LAM_3 to LAM_7 (that is LAM_3, LAM_4, LAM_5, LAM_6, LAM_7) for Vobs.

I have been using this code:

sqrt(mean((Vobs$SOS - Vest$EVI_SOS)^2, na.rm=TRUE))

but I missed the ID subset for both columns of the two different dataframes. How can I do the subset using this code?

Any help will be much appreciated.

Solution

You could get the subsets of the relevant data as:

library(stringr)
diff <- subset(Vobs, ID %in% paste0("LAM_", 3:7))$SOS - 
  subset(Vest, str_detect(ID, "LAM"))$EVI_SOS
sqrt(mean(diff^2, na.rm=TRUE))
#> [1] 2.44949

Answered By - DaveArmstrong

Answer Checked By - Cary Denson (PHPFixing Admin)

[FIXED] How to run a t-test on a subset of data

October 06, 2022 r, statistics, subset, t-test No comments

Issue

First, is this dataset in a tidy form for a t-test?

https://i.stack.imgur.com/tMK6R.png

Second, I'm trying to do a two sample t-test to compare the means at time 3 of treatment a and b for 'outcome 1'. How would I go about doing this?

Sample data:

df <- structure(list(code = c(100, 100, 100, 101, 101, 101, 102, 102, 
      102, 103, 103, 103), treatment = c("a", "a", "a", "b", "b", "b", 
      "a", "a", "a", "b", "b", "b"), sex = c("f", "f", "f", "m", "m", 
      "m", "f", "f", "f", "f", "f", "f"), time = c(1, 2, 3, 1, 2, 3, 
      1, 2, 3, 1, 2, 3), `outcome 1` = c(21, 23, 33, 44, 45, 47, 22, 
      34, 22, 55, 45, 56), `outcome 2` = c(21, 32, 33, 33, 44, 45, 
      22, 57, 98, 65, 42, 42), `outcome 3` = c(62, 84, 63, 51, 45, 
      74, 85, 34, 96, 86, 45, 47)), .Names = c("code", "treatment", 
      "sex", "time", "outcome 1", "outcome 2", "outcome 3"), 
      class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L))

Solution

First you'll have to define the subsets you want tested, then you can run the t-test. You don't have to necessarily store the subsets in variables as I've done, but it makes the t-test output clearer.

Normally with t-test questions, I'd recommend the help provided by ?t.test, but since this involves more complex subsetting, I've included how to do that here:

var_a <- df$`outcome 1`[df$treatment=="a" & df$time==3]
var_b <- df$`outcome 1`[df$treatment=="b" & df$time==3]

t.test(var_a,var_b)

Output:

    Welch Two Sample t-test

data:  var_a and var_b
t = -3.3773, df = 1.9245, p-value = 0.08182
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -55.754265   7.754265
sample estimates:
mean of x mean of y 
     27.5      51.5

Answered By - www

Answer Checked By - Robin (PHPFixing Admin)

[FIXED] How to subset multiple columns from df including grep match

May 17, 2022 dataframe, match, partial, r, subset No comments

Issue

I have a very large data set that includes multiple columns with common portions of their names (e.g ctq_1, ctq_2, ctq_3 and also panas_1, panas_2, panas_3). I'd like to subset some of those columns (e.g. only those containing 'panas' in the column name) alongside certain other columns from the same data frame that have unique names (e.g. id, group).

I tried using a grep function inside square brackets, which worked nicely: panas <- bigdata[ , grep('panas', colnames(bigdata))] but now I need to work out how to also include the other two columns that I need, which are id and group. I tried: panas <- bigdata[ , c('id', 'group', grep('panas', colnames(bigdata)))] but I get this error: Error: Can't find columns 114, 115, 116, 117, 118, … (and 15 more) in .data. Call rlang::last_error() to see a backtrace.

How can I achieve what I want to with the simplest code possible? I am an R newbie so avoiding fancy functions would be ideal!

Here is a reproducible example.


> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> newframe <- iris[ , grep('Petal', colnames(iris))] # This works

> newframe <- iris[ , c('Species', grep('Petal', colnames(iris)))] # This doesn't work

This time, the error is:

Error in [.data.frame(iris, , c("Species", grep("Petal", colnames(iris)))) : undefined columns selected

Solution

Assuming I understood what you would like to do, a possible solution that may not be useful and/or may be redundant:

my_selector <- function(df,partial_name,...){
  positional_names <- match(...,names(df))
  df[,c(positional_names,grep(partial_name,names(df)))]
}
my_selector(iris, partial_name = "Petal","Species")

A "simpler" option would be to use grep and the like to match the target names at once:

iris[grep("Spec.*|Peta.*", names(iris))]

Or even simpler, as suggested by @akrun , we can simply do:

iris[grep("(Spec|Peta).*", names(iris))]

For more columns, we could do something like:

my_selector(iris, partial_name = "Petal",c("Species","Sepal.Length"))
       Species Sepal.Length Petal.Length Petal.Width
1       setosa          5.1          1.4         0.2
2       setosa          4.9          1.4         0.2

Note however that in the above function, the columns are selected counter-intuitively in that the names supplied last are selected first.

Result for the first part(truncated):

         Species Petal.Length Petal.Width
1       setosa          1.4         0.2
2       setosa          1.4         0.2
3       setosa          1.3         0.2
4       setosa          1.5         0.2
5       setosa          1.4         0.2
6       setosa          1.7         0.4
7       setosa          1.4         0.3

Answered By - NelsonGon

Answer Checked By - Senaida (PHPFixing Volunteer)

Tuesday, November 22, 2022

[FIXED] How select area/polygon (lat, long) inside data frame R

Issue

Solution

Monday, November 7, 2022

[FIXED] How Set.contains() decides whether it's a subset or not?

Issue

Solution

Wednesday, November 2, 2022

[FIXED] How to increase speed on finding an index?

Issue

Solution

Saturday, October 29, 2022

[FIXED] How to clean double lines from joint statement in R?

Issue

Solution

Friday, October 7, 2022

[FIXED] How to subset a column for each dataframe in RMSE calculation in R?

Issue

Solution

Thursday, October 6, 2022

[FIXED] How to run a t-test on a subset of data

Issue

Solution

Tuesday, May 17, 2022

[FIXED] How to subset multiple columns from df including grep match

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Tuesday, November 22, 2022

Issue

Solution

Monday, November 7, 2022

Issue

Solution

Wednesday, November 2, 2022

Issue

Solution

Saturday, October 29, 2022

Issue

Solution

Friday, October 7, 2022

Issue

Solution

Thursday, October 6, 2022

Issue

Solution

Tuesday, May 17, 2022

Issue

Solution

Total Pageviews

Featured Post

Subscribe To