What is the use of random seed in spark dataset.Randomsplit
Answers
Answer:
randomSplit will break the rdd into an rdd array of two elemnts, first rdd can be accessed by 0 as its the first array index. Its pretty basic thing.
Explanation:
This has nothing to do with positive and negative examples. Those should already exist (both kinds) within the data set.
You're splitting the data randomly to generate two sets: one to use during training of the ML algorithm (training set), and the second to check whether the training is working (test set). This is widely done and a very good idea, as it catches overfitting which otherwise can make it seem like you have a great ML solution when it's actually effectively just memorized the answer for each data point and can't interpolate or generalize.
In fact, I would recommend that if you have a reasonable amount of data that you split into three data sets, "training" which you run the ML algorithms on; "test", which you use to check how your training is going; and "validation" which you never use until you think your entire ML process is optimized. (The optimization may require using the test set several times e.g. to check convergence, which renders it a somewhat fit-to data set, so it's often hard to be sure you've really avoided overfitting. Holding out the validation set to the very end is the best way to check (or, if you can gather new data, you can do that instead).)
Note that the split is random to avoid problems where the different data sets contain statistically different data; e.g. the early data might be different than the late data, so taking the first half and second half of the data set might cause problems.
This has nothing to do with positive and negative examples. Those should already exist (both kinds) within the data set.
You're splitting the data randomly to generate two sets: one to use during training of the ML algorithm (training set), and the second to check whether the training is working (test set). This is widely done and a very good idea, as it catches overfitting which otherwise can make it seem like you have a great ML solution when it's actually effectively just memorized the answer for each data point and can't interpolate or generalize.
In fact, I would recommend that if you have a reasonable amount of data that you split into three data sets, "training" which you run the ML algorithms on; "test", which you use to check how your training is going; and "validation" which you never use until you think your entire ML process is optimized. (The optimization may require using the test set several times e.g. to check convergence, which renders it a somewhat fit-to data set, so it's often hard to be sure you've really avoided overfitting. Holding out the validation set to the very end is the best way to check (or, if you can gather new data, you can do that instead).)
Note that the split is random to avoid problems where the different data sets contain statistically different data; e.g. the early data might be different than the late data, so taking the first half and second half of the data set might cause problems.
hope it helps u..
תודה...