python - `Optimal` variable initialization and learning rate in Tensorflow for matrix factorization -

i'm trying simple optimization in tensorflow- problem of matrix factorization. given matrix v (m x n), decompose w (m x r) , h (r x n). i'm borrowing gradient descent based tensorflow based implementation matrix factorization here.

details matrix v. in original form, histogram of entries follows:

to bring entries on scale of [0, 1], perform following preprocessing.

f(x) = f(x)-min(v)/(max(v)-min(v))

after normalization, histogram of data following:

my questions are:

given nature of data: between 0 , 1 , entries closer 0 1, optimal initialisation w , h?
how should learning rates defined based on different cost function: |a-wh|_f , |(a-wh)/a|?

the minimal working example follows:

import tensorflow tf import numpy np import pandas pd  v_df = pd.dataframe([[3, 4, 5, 2],                [4, 4, 3, 3],                [5, 5, 4, 4]], dtype=np.float32).t

thus, v_df looks like:

    0   1   2 0   3.0 4.0 5.0 1   4.0 4.0 5.0 2   5.0 3.0 4.0 3   2.0 3.0 4.0

now, code defining w, h

v = tf.constant(v_df.values) shape = v_df.shape rank = 2 #latent factors  initializer = tf.random_normal_initializer(mean=v_df.mean().mean()/5,stddev=0.1 ) #initializer = tf.random_uniform_initializer(maxval=v_df.max().max())  h =  tf.get_variable("h", [rank, shape[1]],                                  initializer=initializer) w =  tf.get_variable(name="w", shape=[shape[0], rank],                                  initializer=initializer) wh = tf.matmul(w, h)

defining cost , optimizer:

f_norm = tf.reduce_sum(tf.pow(v - wh, 2)) lr = 0.01 optimize = tf.train.adagradoptimizer(lr).minimize(f_norm)

running session:

max_iter=10000 display_step = 50  tf.session() sess:     sess.run(tf.global_variables_initializer())      in xrange(max_iter):          loss, _ = sess.run([f_norm, optimize])         if i%display_step==0:             print loss,     w_out = sess.run(w)     h_out = sess.run(h)     wh_out = sess.run(wh)

i realized when used initializer = tf.random_uniform_initializer(maxval=v_df.max().max()), got matrices w , h such product greater v. realised keeping learning rate (lr) .0001 slow.

i wondering if there rules of thumb defining initializations , learning rate problem of matrix factorization.

i think choice of learning rate empirical issue of trial , error, unless device second algorithm find optimal values. practical concern depending on how time have computation finish - given computing resources have available.

however, 1 should careful when setting initialization , learning rates values never converge, depending on machine learning problem. 1 rule of thumb manually change magnitude in steps of 3 , not 10 (according andrew ng): instead of moving 0.1 1.0, go 0.1 0.3.

for specific data features multiple values near 0, possible find optimal initialization values given specific "hypothesis"/model. however, need define "optimal". should method fast possible, accurate possible, or midpoint between these extremes? (accuracy not problem when seeking exact solutions. when is, however, choice of stopping rule , criteria reducing errors can affect outcome.)

even if find optimal parameters set of data, might have problems using same formula other data sets. if wish use same parameters different problem, loose generalizability, unless have strong reasons expect other data sets follow similar distribution.

for specific algorithm @ hand, uses stochastic gradient decent there appears no simple answers*. tensorflow documentation refers 2 sources:

the adagrad algorithm (includes evaluation of performance)

http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
an introduction convex optimization

http://cs.stanford.edu/~ppasupat/a9online/uploads/proximal_notes.pdf

_{* "it clear choosing matrix b in update ... can substantially improve standard gradient method ... often, however, such choice not obvious, , in stochastic settings... highly non-obvious how choose matrix. moreover, in many stochastic settings, not know true function minimizing, since data arrives in stream, pre-computing distance-generating matrix impossible." duchi & singer, 2013, p. 5}

Search This Blog

Breniser

python - `Optimal` variable initialization and learning rate in Tensorflow for matrix factorization -

Comments

Post a Comment

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

python - Error: Unresolved reference 'selenium' What is the reason? -

php - Need to store a large amount of data in session with CI 3 but on storing large data in session it is itself destorying automatically -