Skip to content
Snippets Groups Projects
  1. Nov 01, 2013
  2. Oct 24, 2013
    • Patrick Wendell's avatar
      Add a `repartition` operator. · 08c1a42d
      Patrick Wendell authored
      This patch adds an operator called repartition with more straightforward
      semantics than the current `coalesce` operator. There are a few use cases
      where this operator is useful:
      
      1. If a user wants to increase the number of partitions in the RDD. This
      is more common now with streaming. E.g. a user is ingesting data on one
      node but they want to add more partitions to ensure parallelism of
      subsequent operations across threads or the cluster.
      
      Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
      super confusing.
      
      2. If a user has input data where the number of partitions is not known. E.g.
      
      > sc.textFile("some file").coalesce(50)....
      
      This is both vague semantically (am I growing or shrinking this RDD) but also,
      may not work correctly if the base RDD has fewer than 50 partitions.
      
      The new operator forces shuffles every time, so it will always produce exactly
      the number of new partitions. It also throws an exception rather than silently
      not-working if a bad input is passed.
      
      I am currently adding streaming tests (requires refactoring some of the test
      suite to allow testing at partition granularity), so this is not ready for
      merge yet. But feedback is welcome.
      08c1a42d
  3. Oct 23, 2013
  4. Oct 22, 2013
  5. Oct 19, 2013
  6. Oct 17, 2013
  7. Oct 10, 2013
  8. Oct 09, 2013
  9. Oct 08, 2013
  10. Oct 06, 2013
  11. Oct 04, 2013
  12. Oct 03, 2013
  13. Oct 02, 2013
  14. Sep 24, 2013
  15. Sep 23, 2013
  16. Sep 15, 2013
  17. Sep 11, 2013
  18. Sep 10, 2013
  19. Sep 09, 2013
  20. Sep 08, 2013
  21. Sep 07, 2013
  22. Sep 06, 2013
Loading