HiveBrain v1.2.0
Get Started
← Back to all entries
gotchaMinor

Why does the LR on spark run so slowly?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
whythedoessparkslowlyrun

Problem

Because the MLlib does not support the sparse input, I ran the following code, which supports the sparse input format, on spark clusters. The settings are:

  • 5 nodes, each node with 8 cores (all the CPU on each node are 100%, 98% for user model, when running the code).



  • the input: 10,000,000+ instance, and 600,000+ dimension on HDFS



For the above settings, the code cost 20+ minutes for one iteration.

I have asked this question here, where the first version of the code can be found. And the first version cost 3 hours for one iteration. @Rüdiger Klaehn gave me some suggestions, and I updated the code. The time for one iteration take from 3 hours to 20 minutes. For my application, this is also unacceptable, so can any one give me some suggestions?

```
import java.util.Random
import scala.collection.mutable.HashMap
import scala.io.Source
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.Vector
import java.lang.Math
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.serializer.KryoRegistrator
import com.esotericsoftware.kryo._
import scala.collection.mutable.ArrayBuffer

object SparseLRV3 {
val lableNum = 1
val dimNum = 632918
val iteration = 10
val alpha = 0.1
val lambda = 0.1
val rand = new Random(42)
var w = new ArrayDouble

class LRRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[SparseVector])
kryo.register(classOf[DataPoint])
}
}

class SparseVector(val indices: Array[Int], val values: Array[Double]) {
require(indices.length == values.length)

def map(f: Double => Double): SparseVector = {
new SparseVector(indices, values.map(x => f(x)).toArray)
}

def +(that: SparseVector): SparseVector = {
var tups = new HashMap[Int, Double]
//load touples from this
for (i values(i)
//load touples from that

Solution

It seems to me that if you have very sparse vectors, converting to a non-sparse vector will be very inefficient. E.g. you have a sparse vector with 4 non-zero elements and convert it into a flat vector with 1000000 elements, the array allocation will take the most time.

You can of course delay the conversion for as long as possible and work with sparse vectors all the time, but if apache spark does not support sparse vectors you will have to convert at some point, so you might have to search for some other toolkit.

Context

StackExchange Code Review Q#38354, answer score: 2

Revisions (0)

No revisions yet.