hadoop - Generating star schema in hive -
i sql datawarehouse world flat feed generate dimension , fact tables. in general data warehouse projects divide feed fact , dimension. ex:
i new hadoop , came know can build data warehouse in hive. now, familiar using guid think applicable primary key in hive. so, below strategy right way load fact , dimension in hive?
- load source data hive table; let sales_data_warehouse
generate dimension sales_data_warehouse; ex:
select new_guid(), customer_name, customer_address sales_data_warehouse
when dimensions done load fact table like
select new_guid() 'fact_key', customer.customer_key, store.store_key... sales_data_warehouse 'source' join customer_dimension customer on source.customer_name = customer.customer_name , source.customer_address = customer.customer_address join store_dimension 'store' on store.store_name = source.store_name join product_dimension 'product' on .....
is way should load fact , dimension table in hive?
also, in general warehouse projects need update dimensions attributes (ex: customer_address changed else) or have update fact table foreign key (rarely, happen). so, how can have insert-update load in hive. (like lookup in ssis or merge statement in tsql)?
we still benefits of dimensional models on hadoop , hive. however, features of hadoop require adopt standard approach dimensional modelling.
the hadoop file system immutable. can add not update data. result can append records dimension tables (while hive has added update feature , transactions seems rather buggy). changing dimensions on hadoop become default behaviour. in order latest , date record in dimension table have 3 options. first, can create view retrieves latest record using windowing functions. second, can have compaction service running in background recreates latest state. third, can store our dimension tables in mutable storage, e.g. hbase , federate queries across 2 types of storage.
the way how data distributed across hdfs makes expensive join data. in distributed relational database (mpp) can co-locate records same primary , foreign keys on same node in cluster. makes relatively cheap join large tables. no data needs travel across network perform join. different on hadoop , hdfs. on hdfs tables split big chunks , distributed across nodes on our cluster. don’t have control on how individual records , keys spread across cluster. result joins on hadoop 2 large tables quite expensive data has travel across network. should avoid joins possible. large fact , dimension table can de-normalise dimension table directly fact table. 2 large transaction tables can nest records of child table inside parent table , flatten out data @ run time. can use sql extensions such array_agg in bigquery/postgres etc. handle multiple grains in fact table
i question usefulness of surrogate keys. why not use natural key? maybe performance complex compound keys may issue otherwise surrogate keys not useful , never use them.
Comments
Post a Comment