python - `Optimal` variable initialization and learning rate in Tensorflow for matrix factorization -
i'm trying simple optimization in tensorflow- problem of matrix factorization. given matrix v (m x n)
, decompose w (m x r)
, h (r x n)
. i'm borrowing gradient descent based tensorflow based implementation matrix factorization here.
details matrix v. in original form, histogram of entries follows:
to bring entries on scale of [0, 1], perform following preprocessing.
f(x) = f(x)-min(v)/(max(v)-min(v))
after normalization, histogram of data following:
my questions are:
- given nature of data: between 0 , 1 , entries closer 0 1, optimal initialisation
w
,h
? - how should learning rates defined based on different cost function:
|a-wh|_f
,|(a-wh)/a|
?
the minimal working example follows:
import tensorflow tf import numpy np import pandas pd v_df = pd.dataframe([[3, 4, 5, 2], [4, 4, 3, 3], [5, 5, 4, 4]], dtype=np.float32).t
thus, v_df looks like:
0 1 2 0 3.0 4.0 5.0 1 4.0 4.0 5.0 2 5.0 3.0 4.0 3 2.0 3.0 4.0
now, code defining w, h
v = tf.constant(v_df.values) shape = v_df.shape rank = 2 #latent factors initializer = tf.random_normal_initializer(mean=v_df.mean().mean()/5,stddev=0.1 ) #initializer = tf.random_uniform_initializer(maxval=v_df.max().max()) h = tf.get_variable("h", [rank, shape[1]], initializer=initializer) w = tf.get_variable(name="w", shape=[shape[0], rank], initializer=initializer) wh = tf.matmul(w, h)
defining cost , optimizer:
f_norm = tf.reduce_sum(tf.pow(v - wh, 2)) lr = 0.01 optimize = tf.train.adagradoptimizer(lr).minimize(f_norm)
running session:
max_iter=10000 display_step = 50 tf.session() sess: sess.run(tf.global_variables_initializer()) in xrange(max_iter): loss, _ = sess.run([f_norm, optimize]) if i%display_step==0: print loss, w_out = sess.run(w) h_out = sess.run(h) wh_out = sess.run(wh)
i realized when used initializer = tf.random_uniform_initializer(maxval=v_df.max().max())
, got matrices w , h such product greater v. realised keeping learning rate (lr
) .0001 slow.
i wondering if there rules of thumb defining initializations , learning rate problem of matrix factorization.
i think choice of learning rate empirical issue of trial , error, unless device second algorithm find optimal values. practical concern depending on how time have computation finish - given computing resources have available.
however, 1 should careful when setting initialization , learning rates values never converge, depending on machine learning problem. 1 rule of thumb manually change magnitude in steps of 3 , not 10 (according andrew ng): instead of moving 0.1 1.0, go 0.1 0.3.
for specific data features multiple values near 0, possible find optimal initialization values given specific "hypothesis"/model. however, need define "optimal". should method fast possible, accurate possible, or midpoint between these extremes? (accuracy not problem when seeking exact solutions. when is, however, choice of stopping rule , criteria reducing errors can affect outcome.)
even if find optimal parameters set of data, might have problems using same formula other data sets. if wish use same parameters different problem, loose generalizability, unless have strong reasons expect other data sets follow similar distribution.
for specific algorithm @ hand, uses stochastic gradient decent there appears no simple answers*. tensorflow documentation refers 2 sources:
the adagrad algorithm (includes evaluation of performance)
an introduction convex optimization
http://cs.stanford.edu/~ppasupat/a9online/uploads/proximal_notes.pdf
* "it clear choosing matrix b in update ... can substantially improve standard gradient method ... often, however, such choice not obvious, , in stochastic settings... highly non-obvious how choose matrix. moreover, in many stochastic settings, not know true function minimizing, since data arrives in stream, pre-computing distance-generating matrix impossible." duchi & singer, 2013, p. 5
Comments
Post a Comment