The issue you are facing can be divided into the following :

  • Converting your ratings (I believe) into LabeledPoint data X.
  • Saving X in libsvm format.

1. Converting your ratings into LabeledPoint data X

Let's consider the following raw ratings :

val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")

You can handle those raw ratings as a coordinate list matrix (COO).

Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).

Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD

val data: RDD[MatrixEntry] = 
      sc.parallelize(rawRatings).map {
            line => {
                  val fields = line.split(",")
                  val i = fields(0).toLong
                  val j = fields(1).toLong
                  val value = fields(2).toDouble
                  MatrixEntry(i, j, value)
            }
      }

Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :

val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
                .toIndexedRowMatrix().rows // Extract indexed rows
                .toDF("label", "features") // Convert rows

2. Saving LabeledPoint data in libsvm format

Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")

Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.

Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)

import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)

Now let's save the DataFrame :

convertedVecDF.write.format("libsvm").save("data/foo")

And we can check the files contents :

$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0

EDIT: In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")
Answer from eliasah on Stack Overflow
Top answer
1 of 3
19

The issue you are facing can be divided into the following :

  • Converting your ratings (I believe) into LabeledPoint data X.
  • Saving X in libsvm format.

1. Converting your ratings into LabeledPoint data X

Let's consider the following raw ratings :

val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")

You can handle those raw ratings as a coordinate list matrix (COO).

Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).

Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD

val data: RDD[MatrixEntry] = 
      sc.parallelize(rawRatings).map {
            line => {
                  val fields = line.split(",")
                  val i = fields(0).toLong
                  val j = fields(1).toLong
                  val value = fields(2).toDouble
                  MatrixEntry(i, j, value)
            }
      }

Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :

val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
                .toIndexedRowMatrix().rows // Extract indexed rows
                .toDF("label", "features") // Convert rows

2. Saving LabeledPoint data in libsvm format

Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")

Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.

Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)

import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)

Now let's save the DataFrame :

convertedVecDF.write.format("libsvm").save("data/foo")

And we can check the files contents :

$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0

EDIT: In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")
2 of 3
1

In order to convert an existing to a typed DataSet I suggest the following; Use the following case class:

case class LibSvmEntry (
   value: Double,
   features: L.Vector)

The you can use the map function to convert it to a LibSVM entry like so: df.mapLibSvmEntry

🌐
GitHub
github.com › apache › spark › blob › master › mllib › src › main › scala › org › apache › spark › ml › source › libsvm › LibSVMDataSource.scala
spark/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMDataSource.scala at master · apache/spark
* Dataset<Row> df = spark.read().format("libsvm") * .option("numFeatures, "780") * .load("data/mllib/sample_libsvm_data.txt"); * }}} * * LIBSVM data source supports the following options: * - "numFeatures": number of features. * If unspecified or nonpositive, the number of features will be determined automatically at the ·
Author   apache
🌐
Databricks
api-docs.databricks.com › scala › spark › latest › org › apache › spark › ml › source › libsvm › LibSVMDataSource.html
Databricks Scala Spark API - org.apache.spark.ml.source.libsvm.LibSVMDataSource
// Scala val df = spark.read.format("libsvm") .option("numFeatures", "780") .load("data/mllib/sample_libsvm_data.txt") // Java Dataset<Row> df = spark.read().format("libsvm") .option("numFeatures, "780") .load("data/mllib/sample_libsvm_data.txt");
🌐
Apache Spark
spark.apache.org › docs › latest › api › scala › org › apache › spark › ml › source › libsvm › LibSVMDataSource.html
Spark 4.1.0 ScalaDoc - org.apache.spark.ml.source.libsvm.LibSVMDataSource
// Scala val df = spark.read.format("libsvm") .option("numFeatures", "780") .load("data/mllib/sample_libsvm_data.txt") // Java Dataset<Row> df = spark.read().format("libsvm") .option("numFeatures, "780") .load("data/mllib/sample_libsvm_data.txt");
🌐
Berkeley EECS
people.eecs.berkeley.edu › ~jegonzal › pyspark › _modules › pyspark › mllib › util.html
pyspark.mllib.util — PySpark master documentation
The LIBSVM format is a text-based format used by LIBSVM and LIBLINEAR. Each line represents a labeled sparse feature vector using the following format: label index1:value1 index2:value2 ... where the indices are one-based and in ascending order. This method parses each line into a LabeledPoint, ...
🌐
GitHub
github.com › apache › spark › blob › master › mllib › src › test › scala › org › apache › spark › ml › source › libsvm › LibSVMRelationSuite.scala
spark/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala at master · apache/spark
val df = spark.read.format("libsvm").options(Map("vectorType" -> "dense")) .load(path) assert(df.columns(0) == "label") assert(df.columns(1) == "features") assert(df.count() == 3) val row1 = df.first() assert(row1.getDouble(0) == 1.0) val v = row1.getAs[DenseVector](1) assert(v == Vectors.dense(1.0, 0.0, 2.0, 0.0, 3.0, 0.0)) assert(AttributeGroup.fromStructField(df.schema("features")).size === v.size) } ·
Author   apache
🌐
GitHub
github.com › apache › spark › blob › master › examples › src › main › scala › org › apache › spark › examples › ml › DataFrameExample.scala
spark/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala at master · apache/spark
println(s"Loading LIBSVM file with UDT from ${params.input}.") val df: DataFrame = spark.read.format("libsvm").load(params.input).cache() println("Schema from LIBSVM:") df.printSchema() println(s"Loaded training data as a DataFrame with ${df.count()} records.") ·
Author   apache
Find elsewhere
🌐
GitHub
github.com › ThoroughImages › EasySparse › blob › master › spark_to_libsvm.scala
EasySparse/spark_to_libsvm.scala at master · ThoroughImages/EasySparse
* LibSVM format · */ · import org.apache.spark.rdd.RDD · · /** · * Converting String to Double, · * otherwise return 0.0. · */ · def parseDouble(s: String) = try { s.toDouble } catch { case _ => 0.0 } · · /** · * Load RDD from HDFS and split each row ·
Author   ThoroughImages
🌐
Apache Spark
spark.apache.org › docs › latest › api › java › index.html
Spark 3.4.0 JavaDoc
JavaScript is disabled on your browser · Frame Alert · This document is designed to be viewed using the frames feature. If you see this message, you are using a non-frame-capable web client. Link to Non-frame version
🌐
GitHub
github.com › ajatix › spark-libsvm
GitHub - ajatix/spark-libsvm: LibSVM data source for Spark SQL and DataFrames · GitHub
import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("cars.csv") val selectedData = df.select("year", "model") selectedData.write .format("com.databricks.spark.csv") .option("header", "true") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .save("newcars.csv.gz")
Author   ajatix
🌐
Reddit
reddit.com › r/apachespark › help to transform dataset into libsvm format for multilayer perceptron classification.
r/apachespark on Reddit: Help to Transform dataset into LibSVM format for multilayer perceptron classification.
February 6, 2019 -

Hello. I'm using spark [2.4] for a college's project. I was able to implement the decision tree and random forest after successfully converting my dataset into LibSVM format using pyspark.

Now, i need to use the multilayer preceptron classification. I viewed the examples on the official website/github and the dataset in the data folder. I've notice that, both the datatset, "sample_libsvm_data.txt" and "sample_multiclass_classification_data.txt" are in libsvm format but different structure.

How i do convert my datatset into proper format for multilayer perception?

I'm using pyspark but I' m can also use java and scala to convert the dataset. Thank you.

🌐
GitHub
gist.github.com › xrazor1031 › 6d097e85df7aab956a5be549a4c36408
[dataframe to libsvm] #libsvm · GitHub
[dataframe to libsvm] #libsvm · Raw · dataframe2libsvm.scala · This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
🌐
Apache Spark
spark.apache.org › docs › 3.4.1 › api › java › org › apache › spark › ml › source › libsvm › LibSVMDataSource.html
LibSVMDataSource (Spark 3.4.1 JavaDoc)
// Scala val df = spark.read.format("libsvm") .option("numFeatures", "780") .load("data/mllib/sample_libsvm_data.txt") // Java Dataset<Row> df = spark.read().format("libsvm") .option("numFeatures, "780") .load("data/mllib/sample_libsvm_data.txt");
🌐
Rdrr.io
rdrr.io › cran › sparklyr › man › spark_read_libsvm.html
spark_read_libsvm: Read libsvm file into a Spark DataFrame. in sparklyr: R Interface to Apache Spark
November 5, 2025 - spark_read_libsvm( sc, name = NULL, path = name, repartition = 0, memory = TRUE, overwrite = TRUE, options = list(), ...
Top answer
1 of 2
1
/*
/Users/mac/matrix.txt
1 0.5 2.4 3.0
1 99 34 6454
2 0.8 3.0 4.5
*/
def concat(a:Array[String]):String ={
  var result=a(0)+" "
  for(i<-1 to a.size.toInt-1) 
  result=result+i+":"+a(i)(0)+" "
  return result
}
val rfile=sc.textFile("file:///Users/mac/matrix.txt")
val f=rfile.map(line => line.split(' ')).map(i=>concat(i))

i believe i have a much simpler solution.

2 of 2
0

I was using hadoop for the same but logic should be same. I have created sample example for your use-case. Here first I am creating data-frame and than removing all the rows which have either null or blank values. After that creating RDD and converting Row into libsvm format. "repartition(1)" means everything will go into one file only.There will be one resultant column eg. in case of CTR prediction it will be 1 or 0 only.

Sample file input :

"zip","city","state","latitude","longitude","timezone","dst"
"00210","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00211","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00212","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00213","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00214","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00215","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00501","Holtsville","NY","40.922326","-72.637078","-5","1"
"00544","Holtsville","NY","40.922326","-72.637078","-5","1"

public class LibSvmConvertJob {

    private static final String SPACE = " ";
    private static final String COLON = ":";

    public static void main(String[] args) {

        SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("Libsvm Convertor");

        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);

        SQLContext sqlContext = new SQLContext(javaSparkContext);

        DataFrame inputDF = sqlContext.read().format("com.databricks.spark.csv").option("header", "true")
                .load("/home/raghunandangupta/inputfiles/zipcode.csv");

        inputDF.printSchema();

        sqlContext.udf().register("convertToNull", (String v1) -> (v1.trim().length() > 0 ? v1.trim() : null), DataTypes.StringType);

        inputDF = inputDF.selectExpr("convertToNull(zip)","convertToNull(city)","convertToNull(state)","convertToNull(latitude)","convertToNull(longitude)","convertToNull(timezone)","convertToNull(dst)").na().drop();

        inputDF.javaRDD().map(new Function<Row, String>() {
            private static final long serialVersionUID = 1L;
            @Override
            public String call(Row v1) throws Exception {
                StringBuilder sb = new StringBuilder();
                sb.append(hashCode(v1.getString(0))).append("\t")   //Resultant column
                .append("1"+COLON+hashCode(v1.getString(1))).append(SPACE)
                .append("2"+COLON+hashCode(v1.getString(2))).append(SPACE)
                .append("3"+COLON+hashCode(v1.getString(3))).append(SPACE)
                .append("4"+COLON+hashCode(v1.getString(4))).append(SPACE)
                .append("5"+COLON+hashCode(v1.getString(5))).append(SPACE)
                .append("6"+COLON+hashCode(v1.getString(6)));
                return sb.toString();
            }
            private String hashCode(String value) {
                return Math.abs(Hashing.murmur3_32().hashString(value, StandardCharsets.UTF_8).hashCode()) + "";
            }
        }).repartition(1).saveAsTextFile("/home/raghunandangupta/inputfiles/zipcode");

    }
}
🌐
Databricks
databricks-prod-cloudfront.cloud.databricks.com › public › 4027ec902e239c93eaaa8714f173bcfc › 167428040012665 › 2243788287462925 › 8971546509206599 › latest.html
LR - Databricks
databricks-prod-cloudfront · 2015-11-03T20:50:00.000Z · "d41d8cd98f00b204e9800998ecf8427e" · STANDARD · api/2.0/ · 2015-11-03T20:50:08.000Z · api/2.0/python_client/ · 2015-11-03T20:50:17.000Z
🌐
Csyhuang
csyhuang.github.io › 2018 › 12 › 14 › read-libsvm-to-pyspark-df
Read libsvm files into PySpark dataframe · Clare S. Y. Huang
December 14, 2018 - I wanted to load the libsvm files provided in tensorflow/ranking into PySpark dataframe, but couldn’t find existing modules for that. Here is a version I wrote to do the job. (Disclaimer: not the most elegant solution, but it works.) First of all, load the pyspark utilities required. from pyspark import SparkContext from pyspark.sql import SparkSession, Row from pyspark.ml.linalg import SparseVector