使用R清除数据时使用正则表达式逗号逗号、数据、正则表达式

由网友(月亮男神)分享简介:在我之前的一个问题(Creating adjacency matrix with dirty dataset)中,我能够清除几乎所有的数据。谢谢你们,你们这些出色的程序员。然而,当我试图了解游乐场如何工作时,我继续遇到逗号问题。数据集最初看起来像-Species Association...

在我之前的一个问题(Creating adjacency matrix with dirty dataset)中,我能够清除几乎所有的数据。谢谢你们,你们这些出色的程序员。然而,当我试图了解游乐场如何工作时,我继续遇到逗号问题。

数据集最初看起来像-

Species    Association                  Year
<fctr>     <chr>                        <dbl>
1   RC     SKS/BW                       NA  
2   BW     Sykes, rc                    NA
3   SKS    Babo/bw                      NA
4   RC     baboon, mangabey             NA
5   Mang   red colobus, bw, sykes       NA
6   SKS    babo/red duiker              NA
11  BW     r/c monkeys                  12
21  RC     b/w colobus                  12
31  SKS    b/w colobus/R/c monkeys      12
41  BW     sykes/R/c monkeys            12
51  RC     sykes/b/w colobus            12
61  BABO   -                            12
7   SKS    -                            12
8   RC     -                            12
9   SKS    r/c monkeys                  12
10  RC     sykes monkeys                12
53  BW     sykes,b/w colobus            12
57  BW     r/c monkeys,bw               12
58  Mang   sykes,R/c monkeys            12
大数据分析Python的正则表达式Regular Expressions使用方法

Dput-

dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS", 
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC", "BW", "BW", "Mang"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey", 
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus", 
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus", 
".", ".", ".", "r/c monkeys", "sykes monkeys", "sykes,b/w colobus", "r/c monkeys,bw", "sykes,R/c monkeys"), year = c(NA, NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1", "2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61", "7", "8", "9", "10", "53", "57", "58"), class = "data.frame")

为了进行清理,我创建了一个字典,然后使用正则表达式捕获关联列中除最后三行之外的所有变化,因为它们是用‘,’而不是‘/’分隔的

dict <- read.table(header=TRUE, text='
from  to
"BABO"  BABO
"yellow baboon"  BABO
"BW"  BW
"bw colobus" BW
"Bw" BW
"bw" BW
"Bw colobus" BW
"B/W COLOBUS" BW
"RC"  RC
"RED COLOBUS"  RC
"rc monkeys" RC
"Red colobus" RC
"R/C MONKEYS" RC
"Rc monkeys" RC
"MANGABEY"  MANG
"MANGA" MANG
"mangabeys" MANG
"SKS"  SKS
"SYKES"  SKS
"SYKES MONKEYS" SKS
"sykes" SKS
"SYKES MONKEY" SKS
"RED DUIKER"  RD
"Red duiker" RD
"Red Duiker + V . Fresh dung" RD
')

regex <- '(?<=w{2})/|,s'

spf <- "%s"

data.frame(from=
             sprintf(spf, 
                     sort(unique(unlist(
                       strsplit(toupper(dat$Association), regex, perl=TRUE)))))) |> 
                       print(row.names=FALSE)

res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
  lapply((x) dict[match(x, dict$from), ]$to) |>
  sapply(toString) |>
  {(.) replace(., . == ".", NA)}() |>
  data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
  setNames(c('association', 'site', 'species', 'year')) |>
  subset(select=c(3, 1, 2, 4))

给我一个最终数据框-

Species    Association       Site           Year
<fctr>     <chr>             <chr>          <dbl>
1   RC     SKS, BW           Protected      NA
2   BW     SKS, RC           Protected      NA
3   SKS    BABO, BW          Protected      NA
4   RC     BABO, MANG        Protected      NA
5   MANG   RC, BW, SKS       Protected      NA
6   SKS    BABO, RD          Protected      NA
7   BW     RC                Protected      12
8   RC     BW                Protected      12
9   SKS    BW, RC            Protected      12
10  BW     SKS, RC           Protected      12
11  RC     SKS, BW           Protected      12
12  BABO   NA                Protected      12
13  SKS    NA                Protected      12
14  RC     NA                Protected      12
15  SKS    RC                Protected      12
16  RC     SKS               Protected      12
17  BW     NA                Protected      12
18  BW     NA                Protected      12
19  MANG   NA                Protected      12
我希望包括最后三行以读取正确的关联(即SKS,BW;RC,BW;SKS,RC),但我正在阅读的有关regex的所有内容都将逗号用作表达式的一部分,而不是字符串中找到的内容的一部分。有没有办法把它包括进去,这样它就会给出正确的输出?我仍然是regex的新手,也是R的新手。非常感谢您的帮助。

推荐答案

问题出在您的词典上。使用tidyverse,如下所示:

library(tidyverse)
 dict1 <- dict %>%
  add_row(from = 'BABOON', to = 'BABO') %>%
  add_row(from='.', to = NA) %>%
  add_row(from = '/', to = ',')
  mutate(from = toupper(from))%>%
  distinct() %>%
  arrange(desc(nchar(from)))

dat %>%
  mutate(Association = str_replace_all(toupper(Association), 
                              fixed(setNames(dict1$to, dict1$from))),
         Site = 'Protected')


  Species Association year      Site
1       RC      SKS,BW   NA Protected
2       BW     SKS, RC   NA Protected
3      SKS     BABO,BW   NA Protected
4       RC  BABO, MANG   NA Protected
5     Mang RC, BW, SKS   NA Protected
6      SKS     BABO,RD   NA Protected
11      BW          RC   12 Protected
21      RC          BW   12 Protected
31     SKS       BW,RC   12 Protected
41      BW      SKS,RC   12 Protected
51      RC      SKS,BW   12 Protected
61    BABO        <NA>   12 Protected
7      SKS        <NA>   12 Protected
8       RC        <NA>   12 Protected
9      SKS          RC   12 Protected
10      RC SKS MONKEYS   12 Protected
53      BW      SKS,BW   12 Protected
57      BW       RC,BW   12 Protected
58    Mang      SKS,RC   12 Protected
阅读全文

相关推荐

最新文章