Brave Search

stackoverflow.com › questions › 41416291 › how-to-prepare-data-into-a-libsvm-format-from-dataframe

apache spark - How to prepare data into a LibSVM format from DataFrame? - Stack Overflow

1 of 3

The issue you are facing can be divided into the following :

Converting your ratings (I believe) into LabeledPoint data X.
Saving X in libsvm format.

1. Converting your ratings into LabeledPoint data X

Let's consider the following raw ratings :

val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")

You can handle those raw ratings as a coordinate list matrix (COO).

Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).

Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD

val data: RDD[MatrixEntry] = 
      sc.parallelize(rawRatings).map {
            line => {
                  val fields = line.split(",")
                  val i = fields(0).toLong
                  val j = fields(1).toLong
                  val value = fields(2).toDouble
                  MatrixEntry(i, j, value)
            }
      }

Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :

val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
                .toIndexedRowMatrix().rows // Extract indexed rows
                .toDF("label", "features") // Convert rows

2. Saving LabeledPoint data in libsvm format

Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")

import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)

Now let's save the DataFrame :

convertedVecDF.write.format("libsvm").save("data/foo")

And we can check the files contents :

$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0

EDIT: In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")

2 of 3

In order to convert an existing to a typed DataSet I suggest the following; Use the following case class:

case class LibSvmEntry (
   value: Double,
   features: L.Vector)

The you can use the map function to convert it to a LibSVM entry like so: df.mapLibSvmEntry

stackoverflow.com › questions › 43920111 › convert-dataframe-to-libsvm-format

apache spark - convert dataframe to libsvm format - Stack Overflow

1 of 3

I would act like that (it's just an example with an arbitrary dataframe, I don't know how your df1 is done, focus is on data transformations):

This is my way to convert dataframe to libsvm format:

# ... your previous imports

from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint

# A DATAFRAME
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  3|  6|  
|  4|  5| 20|
|  7|  8|  8|
+---+---+---+

# FROM DATAFRAME TO RDD
>>> c = df.rdd # this command will convert your dataframe in a RDD
>>> print (c.take(3))
[Row(_1=1, _2=3, _3=6), Row(_1=4, _2=5, _3=20), Row(_1=7, _2=8, _3=8)]

# FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
>>> d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # arbitrary mapping, it's just an example
>>> print (d.take(3))
[LabeledPoint(1.0, [3.0,6.0]), LabeledPoint(4.0, [5.0,20.0]), LabeledPoint(7.0, [8.0,8.0])]

# SAVE AS LIBSVM
>>> MLUtils.saveAsLibSVMFile(d, "/your/Path/nameFolder/")

What you will see on the "/your/Path/nameFolder/part-0000*" files is:

1.0 1:3.0 2:6.0

4.0 1:5.0 2:20.0

7.0 1:8.0 2:8.0

See here for LabeledPoint docs

2 of 3

I had to do this for it to work

D.map(lambda line: LabeledPoint(line[0],[line[1],line[2]]))

github.com › apache › spark › blob › master › mllib › src › main › scala › org › apache › spark › ml › source › libsvm › LibSVMDataSource.scala

spark/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMDataSource.scala at master · apache/spark

* Dataset<Row> df = spark.read().format("libsvm") * .option("numFeatures, "780") * .load("data/mllib/sample_libsvm_data.txt"); * }}} * * LIBSVM data source supports the following options: * - "numFeatures": number of features. * If unspecified or nonpositive, the number of features will be determined automatically at the ·

Author apache

Kaggle

kaggle.com › general › 16986

Input Files to Spark Mllib | Kaggle

Input Files to Spark Mllib

Databricks

api-docs.databricks.com › scala › spark › latest › org › apache › spark › ml › source › libsvm › LibSVMDataSource.html

Databricks Scala Spark API - org.apache.spark.ml.source.libsvm.LibSVMDataSource

// Scala val df = spark.read.format("libsvm") .option("numFeatures", "780") .load("data/mllib/sample_libsvm_data.txt") // Java Dataset<Row> df = spark.read().format("libsvm") .option("numFeatures, "780") .load("data/mllib/sample_libsvm_data.txt");

Apache Spark

spark.apache.org › docs › latest › api › scala › org › apache › spark › ml › source › libsvm › LibSVMDataSource.html

Spark 4.1.0 ScalaDoc - org.apache.spark.ml.source.libsvm.LibSVMDataSource

Berkeley EECS

people.eecs.berkeley.edu › ~jegonzal › pyspark › _modules › pyspark › mllib › util.html

pyspark.mllib.util — PySpark master documentation

The LIBSVM format is a text-based format used by LIBSVM and LIBLINEAR. Each line represents a labeled sparse feature vector using the following format: label index1:value1 index2:value2 ... where the indices are one-based and in ascending order. This method parses each line into a LabeledPoint, ...

github.com › apache › spark › blob › master › mllib › src › test › scala › org › apache › spark › ml › source › libsvm › LibSVMRelationSuite.scala

spark/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala at master · apache/spark

val df = spark.read.format("libsvm").options(Map("vectorType" -> "dense")) .load(path) assert(df.columns(0) == "label") assert(df.columns(1) == "features") assert(df.count() == 3) val row1 = df.first() assert(row1.getDouble(0) == 1.0) val v = row1.getAs[DenseVector](1) assert(v == Vectors.dense(1.0, 0.0, 2.0, 0.0, 3.0, 0.0)) assert(AttributeGroup.fromStructField(df.schema("features")).size === v.size) } ·

Author apache

github.com › apache › spark › blob › master › examples › src › main › scala › org › apache › spark › examples › ml › DataFrameExample.scala

spark/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala at master · apache/spark

println(s"Loading LIBSVM file with UDT from ${params.input}.") val df: DataFrame = spark.read.format("libsvm").load(params.input).cache() println("Schema from LIBSVM:") df.printSchema() println(s"Loaded training data as a DataFrame with ${df.count()} records.") ·

Author apache

Find elsewhere

Google Bing Mojeek

github.com › ThoroughImages › EasySparse › blob › master › spark_to_libsvm.scala

EasySparse/spark_to_libsvm.scala at master · ThoroughImages/EasySparse

* LibSVM format · */ · import org.apache.spark.rdd.RDD · · /** · * Converting String to Double, · * otherwise return 0.0. · */ · def parseDouble(s: String) = try { s.toDouble } catch { case _ => 0.0 } · · /** · * Load RDD from HDFS and split each row ·

Author ThoroughImages

stackoverflow.com › questions › 44965186 › how-to-understand-the-format-type-of-libsvm-of-spark-mllib

How to understand the format type of libsvm of Spark MLlib? - Stack Overflow

spark.apache.org › docs › latest › api › java › index.html

1 of 1

The LibSVM format is quite simple. The first row contains the class label, in this case 0 or 1. Following that are the features, here there are two values for each one; the first one is the feature index (i.e. which feature it is) and the second one is the actual value.

The feature indices starts from 1 (there is no index 0) and are in ascending order. The indices not present on a row are 0.

In summary, each row looks like this;

<label> <index1>:<value1> <index2>:<value2> ... <indexN>:<valueN>

This format is advantageous to use when the data is sparse and contain lots of zeroes. All 0 values are not saved which will make the files both smaller and easier to read.

Apache Spark

Spark 3.4.0 JavaDoc

JavaScript is disabled on your browser · Frame Alert · This document is designed to be viewed using the frames feature. If you see this message, you are using a non-frame-capable web client. Link to Non-frame version

github.com › ajatix › spark-libsvm

GitHub - ajatix/spark-libsvm: LibSVM data source for Spark SQL and DataFrames · GitHub

import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("cars.csv") val selectedData = df.select("year", "model") selectedData.write .format("com.databricks.spark.csv") .option("header", "true") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .save("newcars.csv.gz")

Author ajatix

reddit.com › r/apachespark › help to transform dataset into libsvm format for multilayer perceptron classification.

r/apachespark on Reddit: Help to Transform dataset into LibSVM format for multilayer perceptron classification.

February 6, 2019 -

Hello. I'm using spark [2.4] for a college's project. I was able to implement the decision tree and random forest after successfully converting my dataset into LibSVM format using pyspark.

Now, i need to use the multilayer preceptron classification. I viewed the examples on the official website/github and the dataset in the data folder. I've notice that, both the datatset, "sample_libsvm_data.txt" and "sample_multiclass_classification_data.txt" are in libsvm format but different structure.

How i do convert my datatset into proper format for multilayer perception?

I'm using pyspark but I' m can also use java and scala to convert the dataset. Thank you.

Last time I checked that wasnt required. Are you using SparkML?

1 of 1

gist.github.com › xrazor1031 › 6d097e85df7aab956a5be549a4c36408

[dataframe to libsvm] #libsvm · GitHub

[dataframe to libsvm] #libsvm · Raw · dataframe2libsvm.scala · This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.

Apache Spark

spark.apache.org › docs › 3.4.1 › api › java › org › apache › spark › ml › source › libsvm › LibSVMDataSource.html

LibSVMDataSource (Spark 3.4.1 JavaDoc)

Rdrr.io

rdrr.io › cran › sparklyr › man › spark_read_libsvm.html

spark_read_libsvm: Read libsvm file into a Spark DataFrame. in sparklyr: R Interface to Apache Spark

November 5, 2025 - spark_read_libsvm( sc, name = NULL, path = name, repartition = 0, memory = TRUE, overwrite = TRUE, options = list(), ...

stackoverflow.com › questions › 40037395 › spark-converting-csv-to-libsvm-format

spark converting CSV to libsvm format - Stack Overflow