r - How to effectively flatten nested lists and dataframes into a single dataframe? -
i have data formatted in way that's difficult use, i'm trying flatten out. minimum reproducible example here.
> str(sampledata) list of 4 $ events :'data.frame': 2 obs. of 3 variables: ..$ cateringoptions:list of 2 .. ..$ :'data.frame': 1 obs. of 3 variables: .. .. ..$ agreed : logi true .. .. ..$ tnc :'data.frame': 1 obs. of 5 variables: .. .. .. ..$ identity : chr "spicyowing" .. .. .. ..$ schema : logi na .. .. .. ..$ elementid : chr "105031" .. .. .. ..$ elementtype : logi na .. .. .. ..$ elementversion: logi na .. .. ..$ address: chr "new york" .. ..$ :'data.frame': 1 obs. of 3 variables: .. .. ..$ agreed : logi true .. .. ..$ tnc :'data.frame': 1 obs. of 5 variables: .. .. .. ..$ identity : chr "baconeggs" .. .. .. ..$ schema : logi na .. .. .. ..$ elementid : chr "105032" .. .. .. ..$ elementtype : logi na .. .. .. ..$ elementversion: logi na .. .. ..$ address: chr "seattle" ..$ action : num [1:2] 1 1 ..$ volume : num [1:2] 1000 2000 $ host :list of 5 ..$ identity : chr "john" ..$ schema : logi na ..$ elementid : chr "101505" ..$ elementtype : logi na ..$ elementversion: logi na $ sender :list of 5 ..$ identity : chr "jane" ..$ schema : logi na ..$ elementid : chr "101005" ..$ elementtype : logi na ..$ elementversion: logi na $ completeddate: chr "/date(1490112000000)/"
expected
> expectedoutcome events.cateringoptions.agreed events.cateringoptions.tnc.identity events.cateringoptions.tnc.schema events.cateringoptions.tnc.elementid 1 na spicyowing true 105031 2 na baconeggs true 105032 events.cateringoptions.tnc.elementtype events.cateringoptions.tnc.elementversion events.cateringoptions.address events.action events.volume host.identity 1 na na new york 1 1000 john 2 na na seattle 1 2000 john host.schema host.elementid host.elementtype host.elementversion sender.identity sender.schema sender.elementid sender.elementtype sender.elementversion 1 na 101505 na na jane na 101005 na na 2 na 101505 na na jane na 101005 na na completeddate 1 /date(1490112000000)/ 2 /date(1490112000000)/
the check function
check<-function(li){ aredf<-sapply(1:length(li), function(i) class(li[[i]]) == "data.frame") arelist<-sapply(1:length(li), function(i) class(li[[i]]) == "list") tmp1 <- null tmp2 <- null if(any(aredf)){ for(j in which(aredf)){ columns <- jsonlite::flatten(li[[j]]) li[[j]] <- check(columns) } tmp1<-plyr::rbind.fill(li[aredf]) #return(tmp1) } if(any(arelist)){ for(j in which(arelist)){ li[[j]]<-check(li[[j]]) } tmp2<-do.call(cbind,li) #return(tmp2) } if(!is.null(tmp1) & !is.null(tmp2)){ return (cbind(tmp1,tmp2)) } else if(!is.null(tmp1)){ return (tmp1) } else if(!is.null(tmp2)){ return (tmp2) } return(li) }
results
> str(check(sampledata)) 'data.frame': 2 obs. of 29 variables: $ cateringoptions.agreed : logi true true $ cateringoptions.address : chr "new york" "seattle" $ cateringoptions.tnc.identity : chr "spicyowing" "baconeggs" $ cateringoptions.tnc.schema : logi na na $ cateringoptions.tnc.elementid : chr "105031" "105032" $ cateringoptions.tnc.elementtype : logi na na $ cateringoptions.tnc.elementversion : logi na na $ action : num 1 1 $ volume : num 1000 2000 $ events.cateringoptions.agreed : logi true true $ events.cateringoptions.address : chr "new york" "seattle" $ events.cateringoptions.tnc.identity : chr "spicyowing" "baconeggs" $ events.cateringoptions.tnc.schema : logi na na $ events.cateringoptions.tnc.elementid : chr "105031" "105032" $ events.cateringoptions.tnc.elementtype : logi na na $ events.cateringoptions.tnc.elementversion: logi na na $ events.action : num 1 1 $ events.volume : num 1000 2000 $ host.identity : factor w/ 1 level "john": 1 1 $ host.schema : logi na na $ host.elementid : factor w/ 1 level "101505": 1 1 $ host.elementtype : logi na na $ host.elementversion : logi na na $ sender.identity : factor w/ 1 level "jane": 1 1 $ sender.schema : logi na na $ sender.elementid : factor w/ 1 level "101005": 1 1 $ sender.elementtype : logi na na $ sender.elementversion : logi na na $ completeddate : factor w/ 1 level "/date(1490112000000)/": 1 1
i have it, nested dataframe being duped. also, code takes long. have idea how can go flattening this?
edit:
i added solution in end in gist
here take @ it, purrr
.
idea similar yours, different syntax: flatten()
nested dataframes, rbind()
them.
if understand code properly, mine different @ end, since i'll try more "jsonlite::flatten
-friendly" structure apply once more end result:
library(jsonlite) library(purrr) res <- sampledata %>% modify_if( is.list, .f = ~ modify_if( .x, .p = function(x) all(sapply(x, is.data.frame)), .f = ~ do.call("rbind", lapply(.x, jsonlite::flatten)) ) ) %>% as.data.frame() %>% jsonlite::flatten() str(res) # 'data.frame': 2 obs. of 20 variables: # $ events.action : num 1 1 # $ events.volume : num 1000 2000 # $ host.identity : chr "john" "john" # $ host.schema : logi na na # $ host.elementid : chr "101505" "101505" # $ host.elementtype : logi na na # $ host.elementversion : logi na na # $ sender.identity : chr "jane" "jane" # $ sender.schema : logi na na # $ sender.elementid : chr "101005" "101005" # $ sender.elementtype : logi na na # $ sender.elementversion : logi na na # $ completeddate : chr "/date(1490112000000)/" "/date(1490112000000)/" # $ events.cateringoptions.agreed : logi true true # $ events.cateringoptions.address : chr "new york" "seattle" # $ events.cateringoptions.tnc.identity : chr "spicyowing" "baconeggs" # $ events.cateringoptions.tnc.schema : logi na na # $ events.cateringoptions.tnc.elementid : chr "105031" "105032" # $ events.cateringoptions.tnc.elementtype : logi na na # $ events.cateringoptions.tnc.elementversion: logi na na
i've got 1 mismatch expectedoutcome
if may, might on side:
all.equal(expectedoutcome[sort(names(expectedoutcome))], res[sort(names(res))]) # [1] "component “events.cateringoptions.agreed”: 'is.na' value mismatch: 0 in current 2 in target"
Comments
Post a Comment