Skip to content
Snippets Groups Projects
  • Doris Xin's avatar
    1de1d703
    SPARK-1939 Refactor takeSample method in RDD to use ScaSRS · 1de1d703
    Doris Xin authored
    Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.
    
    Author: Doris Xin <doris.s.xin@gmail.com>
    Author: dorx <doris.s.xin@gmail.com>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #916 from dorx/takeSample and squashes the following commits:
    
    5b061ae [Doris Xin] merge master
    444e750 [Doris Xin] edge cases
    3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939
    82dde31 [Xiangrui Meng] update pyspark's takeSample
    48d954d [Doris Xin] remove unused imports from RDDSuite
    fb1452f [Doris Xin] allowing num to be greater than count in all cases
    1481b01 [Doris Xin] washing test tubes and making coffee
    dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
    64e445b [Doris Xin] logwarnning as soon as it enters the while loop
    55518ed [Doris Xin] added TODO for logging in rdd.py
    eff89e2 [Doris Xin] addressed reviewer comments.
    ecab508 [Doris Xin] "fixed checkstyle violation
    0a9b3e3 [Doris Xin] "reviewer comment addressed"
    f80f270 [Doris Xin] Merge branch 'master' into takeSample
    ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
    065ebcd [Doris Xin] Merge branch 'master' into takeSample
    9bdd36e [Doris Xin] Check sample size and move computeFraction
    e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
    7cab53a [Doris Xin] fixed import bug in rdd.py
    ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
    1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
    1de1d703
    History
    SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
    Doris Xin authored
    Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.
    
    Author: Doris Xin <doris.s.xin@gmail.com>
    Author: dorx <doris.s.xin@gmail.com>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #916 from dorx/takeSample and squashes the following commits:
    
    5b061ae [Doris Xin] merge master
    444e750 [Doris Xin] edge cases
    3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939
    82dde31 [Xiangrui Meng] update pyspark's takeSample
    48d954d [Doris Xin] remove unused imports from RDDSuite
    fb1452f [Doris Xin] allowing num to be greater than count in all cases
    1481b01 [Doris Xin] washing test tubes and making coffee
    dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
    64e445b [Doris Xin] logwarnning as soon as it enters the while loop
    55518ed [Doris Xin] added TODO for logging in rdd.py
    eff89e2 [Doris Xin] addressed reviewer comments.
    ecab508 [Doris Xin] "fixed checkstyle violation
    0a9b3e3 [Doris Xin] "reviewer comment addressed"
    f80f270 [Doris Xin] Merge branch 'master' into takeSample
    ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
    065ebcd [Doris Xin] Merge branch 'master' into takeSample
    9bdd36e [Doris Xin] Check sample size and move computeFraction
    e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
    7cab53a [Doris Xin] fixed import bug in rdd.py
    ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
    1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS