You have to do some encoding before using fit(). As it was told fit() does not accept strings, but you solve this.
There are several classes that can be used :
LabelEncoder: turn your string into incremental valueOneHotEncoder: use One-of-K algorithm to transform your String into integer
Personally, I have post almost the same question on Stack Overflow some time ago. I wanted to have a scalable solution, but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective, but if you have a lot of different strings the matrix will grow very quickly and memory will be required.
Answer from RPresle on Stack OverflowYou have to do some encoding before using fit(). As it was told fit() does not accept strings, but you solve this.
There are several classes that can be used :
LabelEncoder: turn your string into incremental valueOneHotEncoder: use One-of-K algorithm to transform your String into integer
Personally, I have post almost the same question on Stack Overflow some time ago. I wanted to have a scalable solution, but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective, but if you have a lot of different strings the matrix will grow very quickly and memory will be required.
LabelEncoding worked for me (basically you've to encode your data feature-wise) (mydata is a 2d array of string datatype):
myData=np.genfromtxt(filecsv, delimiter=",", dtype ="|a20" ,skip_header=1);
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for i in range(*NUMBER OF FEATURES*):
myData[:,i] = le.fit_transform(myData[:,i])
Obviously some of your lines don't have valid float data, specifically some line have text id which can't be converted to float.
When you try it in interactive prompt you are trying only first line, so best way is to print the line where you are getting this error and you will know the wrong line e.g.
#!/usr/bin/python
import os,sys
from scipy import stats
import numpy as np
f=open('data2.txt', 'r').readlines()
N=len(f)-1
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
try:
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
except ValueError,e:
print "error",e,"on line",i
result=stats.ttest_ind(list1,list2)
print result[1]
My error was very simple: the text file containing the data had some space (so not visible) character on the last line.
As an output of grep, I had 45 instead of just 45.
So I am trying to use the Naive Bayes model from Sklearn to run some data. It is a machine learning model. Although every time I run it, it says I can't convert a string to a float. Here's my code:
# Import necessary libraries.
from sklearn.naive_bayes import MultinomialNB # Import the Naive Bayes model.
from sklearn.model_selection import train_test_split # Import the train_test_split function.
from sklearn.metrics import accuracy_score # For testing the accuracy of the model
import pandas as pd # Import pandas
print('NAIVE_BAYES.PY IS RUNNING')
df = pd.read_csv('C:\\Users\\gjohn\\Documents\\code\\machineLearning\\trading_bot\\train_test.csv') # Reads in the filtered posts.
classes = df['class'] # Gets the labels from the dataframe.
df.drop('class', axis=1) # Drops the class column from the dataframe.
# Split the data into training and testing data.
train_x, test_x, train_y, test_y = train_test_split(df, classes, test_size=0.2)
# Create the Naive Bayes model.
model = MultinomialNB()
# Train the model.
model.fit(train_x, train_y)
# Test the model.
y_predict = model.predict(test_x)
# Calculate the accuracy of the model.
accuracy = accuracy_score(test_y, y_predict)
print(f'Accuracy: {accuracy}')And here is the error:
Traceback (most recent call last):
File "c:\Users\gjohn\Documents\code\machineLearning\trading_bot\naive_bayes.py", line 22, in <module>
model.fit(train_x, train_y)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\naive_bayes.py", line 663, in fit
X, y = self._check_X_y(X, y)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\naive_bayes.py", line 523, in _check_X_y
return self._validate_data(X, y, accept_sparse="csr", reset=reset)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 572, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py", line 956, in check_X_y
X = check_array(
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py", line 738, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\_asarray.py", line 83, in asarray
return array(a, dtype, copy=False, order=order)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
return np.asarray(self._values, dtype=dtype)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\_asarray.py", line 83, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'Tranzylvania'And every time I run it, it is a different word. For example this time it was "Tranzylvania", but the next time I run it, there will be something else. Can anyone please help me? I'm stumped.
Once I assume you are using text data as your input matrix X. The first point is that you have to include your preprocessing step as you would do when not using a calibrated classifier, so as you already know you can use a Pipeline like so:
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid',
cv=3)
model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', calibrated_svc)]).fit(X, y)
Another option if your are interested in using probabilities in your SVM you can set the parameter probability = True inside your SVM but using the class SVC with a linear kernel is equvilalent to LinearSVC like:
model = Pipeline([('tfidf', TfidfVectorizer()), ('clf',SVC(probability = True, kernel = 'linear') )]).fit(X, y)
This will run a Logistic regression on the top of the binary predictions of the SVM.
Both options are feasible if you are only interested in using probabilities per se but if you are also interested on the calibration of your probabilities, the first option is better
For any kind of Machine Learning task or a NLP task (which is what you are doing), you need to convert string/text values to numeric values. The machine cannot uderstand or work with string values. It only understands numeric values.
So for example if you are doing a machine learning task, you would use libraries like OneHotEncoder, LabelEncoder etc to covert string values to numeric.
For your case, you are working on a NLP task which uses text values instead of string values. So you need to convert them into numeric values first and then fit the preferred algorithm. There are many ways to encode text into numeric such as Bag of Words, Tfidf, word2vec etc. You can read about them by searching on Google.
I get the following error when I run my script - "ValueError: could not convert string to float: Normal
Any help would be greatly appreciated
rom sklearn.linear_model import LogisticRegression #logistic regression from sklearn import svm #support vector Machine from sklearn.ensemble import RandomForestClassifier #Random Forest from sklearn.neighbors import KNeighborsClassifier #KNN from sklearn.naive_bayes import GaussianNB #Naive bayes from sklearn.tree import DecisionTreeClassifier #Decision Tree from sklearn.model_selection import train_test_split #training and testing data split from sklearn import metrics #accuracy measure from sklearn.metrics import confusion_matrix #for confusion matrix
train,test=train_test_split(train_csv, test_size=0.3, random_state=0) train_X=train[train.columns[1:]] train_Y=train[train.columns[:1]] test_X=test[test.columns[1:]] test_Y=test[test.columns[:1]] X=train_csv[train_csv.columns[1:]] Y=train_csv['SalePrice']
Radial Support Vector Machines(rbf-SVM)
model=svm.SVC(kernel='rbf',C=1,gamma=0.1) model.fit(train_X, train_Y) prediction1=model.predict(test_X) print('Accuracy for rbf SVM is ', metrics.accuracy_score(prediction1,test_Y))