Issue
Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?
val rdd=sc.textFile("file1,file2,file3")
Now, how can we skip header lines from this rdd?
Solution
If there were just one header line in the first record, then the most efficient way to filter it out would be:
rdd.mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}
This doesn't help if of course there are many files with many header lines inside. You can union three RDDs you make this way, indeed.
You could also just write a filter
that matches only a line that could be a header. This is quite simple, but less efficient.
Python equivalent:
from itertools import islice
rdd.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)
Answered By - Sean Owen Answer Checked By - Timothy Miller (PHPFixing Admin)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.