Skip to content
Snippets Groups Projects
  • Patrick Wendell's avatar
    08c1a42d
    Add a `repartition` operator. · 08c1a42d
    Patrick Wendell authored
    This patch adds an operator called repartition with more straightforward
    semantics than the current `coalesce` operator. There are a few use cases
    where this operator is useful:
    
    1. If a user wants to increase the number of partitions in the RDD. This
    is more common now with streaming. E.g. a user is ingesting data on one
    node but they want to add more partitions to ensure parallelism of
    subsequent operations across threads or the cluster.
    
    Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
    super confusing.
    
    2. If a user has input data where the number of partitions is not known. E.g.
    
    > sc.textFile("some file").coalesce(50)....
    
    This is both vague semantically (am I growing or shrinking this RDD) but also,
    may not work correctly if the base RDD has fewer than 50 partitions.
    
    The new operator forces shuffles every time, so it will always produce exactly
    the number of new partitions. It also throws an exception rather than silently
    not-working if a bad input is passed.
    
    I am currently adding streaming tests (requires refactoring some of the test
    suite to allow testing at partition granularity), so this is not ready for
    merge yet. But feedback is welcome.
    08c1a42d
    History
    Add a `repartition` operator.
    Patrick Wendell authored
    This patch adds an operator called repartition with more straightforward
    semantics than the current `coalesce` operator. There are a few use cases
    where this operator is useful:
    
    1. If a user wants to increase the number of partitions in the RDD. This
    is more common now with streaming. E.g. a user is ingesting data on one
    node but they want to add more partitions to ensure parallelism of
    subsequent operations across threads or the cluster.
    
    Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
    super confusing.
    
    2. If a user has input data where the number of partitions is not known. E.g.
    
    > sc.textFile("some file").coalesce(50)....
    
    This is both vague semantically (am I growing or shrinking this RDD) but also,
    may not work correctly if the base RDD has fewer than 50 partitions.
    
    The new operator forces shuffles every time, so it will always produce exactly
    the number of new partitions. It also throws an exception rather than silently
    not-working if a bad input is passed.
    
    I am currently adding streaming tests (requires refactoring some of the test
    suite to allow testing at partition granularity), so this is not ready for
    merge yet. But feedback is welcome.