sql - Losing rows on Google BigQuery after doing a WHERE / through unnesting/flattening -


i got daily tables google analytics website. in table 167.286 rows. main target create new table (csv) needed columns , rows. using legacy sql have query:

select   concat(cast(visitid string), cast(fullvisitorid string), cast(visitnumber string), cast(hits.hitnumber string)) identifier,   hits.hitnumber hitnumber,   hits.page.pagepath pagepath,   hits.page.pagepathlevel1 pagepathlevel1,   hits.page.pagepathlevel2 pagepathlevel2,   hits.appinfo.exitscreenname exitscreenname,   hits.eventinfo.eventcategory eventcategory,   hits.eventinfo.eventaction eventaction,   hits.eventinfo.eventlabel eventlabel,   hits.customdimensions.value value,   hits.customdimensions.index index,   visitid,   fullvisitorid,   date,   visitnumber,   totals.hits hits,   totals.pageviews pageviews,   device.devicecategory devicecategory,   geonetwork.city,   channelgrouping,   trafficsource.campaign campaign,   trafficsource.source source,   trafficsource.medium medium   [project:dataset.table] not hits.customdimensions.value = "privateuser" , not hits.customdimensions.value = "loggedin" 

at first moment using statement losing rows should not effected statement.

after got total number of 77.250 rows. each of excluded values has 23.825 rows. 23.825 rows when modify statement where hits.customdimensions.value = "privateuser". same goes "loggedin".

167.286 - 2*23.825 = 119.636 , not 77.250. losing 42.386 rows , don´t have clue why. has idea why happening? want have rows except rows value not privateuser , loggedin. , should more 77.250.

this problem appears in legacy , standard sql. same query in standard sql:

select     columns (modified unnesting statement)      `project.dataset.table`, unnest(hits) h, unnest(h.customdimensions) c      not c.value = "privateuser" , not c.value = "loggedin" 

i losing 42.386 rows again , dont kow why :(

i think coming more near cause of problem: nested schema.

if doing standard sql query without statement total number of 124.900 rows. here 42.386 rows missing again. seems reason unnesting or flattening of data.

maybe better question "how can unnesting or flattening data without loss of data?".

if click on button "hide options". selecting output table , and activating "allow large results" , "flatten results" error message "cannot flatten non-repeated field hits_1" f.e.

bigquery's legacy sql dialect can have confusing semantics in relation count(*), assuming that using. if use standard sql instead, results expected. query might like:

#standardsql select count(*) `your-dataset.your-table` not exists (   select 1 unnest(customdimensions)   value in ('privateuser', 'loggedin') ); 

alternatively, if trying count number of customdimensions elements value not 1 of strings, can do:

#standardsql select   sum((select count(*) unnest(customdimensions)        value not in ('privateuser', 'loggedin'))) value_count `your-dataset.your-table`; 

you can read more differences between legacy , standard sql in migration guide.


Comments

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

c# - Selenium Authentication Popup preventing driver close or quit -

tensorflow when input_data MNIST_data , zlib.error: Error -3 while decompressing: invalid block type -