Skip to content
  • Davies Liu's avatar
    8df4dad4
    [SPARK-2871] [PySpark] add approx API for RDD · 8df4dad4
    Davies Liu authored
    RDD.countApprox(self, timeout, confidence=0.95)
    
            :: Experimental ::
            Approximate version of count() that returns a potentially incomplete
            result within a timeout, even if not all tasks have finished.
    
            >>> rdd = sc.parallelize(range(1000), 10)
            >>> rdd.countApprox(1000, 1.0)
            1000
    
    RDD.sumApprox(self, timeout, confidence=0.95)
    
            Approximate operation to return the sum within a timeout
            or meet the confidence.
    
            >>> rdd = sc.parallelize(range(1000), 10)
            >>> r = sum(xrange(1000))
            >>> (rdd.sumApprox(1000) - r) / r < 0.05
    
    RDD.meanApprox(self, timeout, confidence=0.95)
    
            :: Experimental ::
            Approximate operation to return the mean within a timeout
            or meet the confidence.
    
            >>> rdd = sc.parallelize(range(1000), 10)
            >>> r = sum(xrange(1000)) / 1000.0
            >>> (rdd.meanApprox(1000) - r) / r < 0.05
            True
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #2095 from davies/approx and squashes the following commits:
    
    e8c252b [Davies Liu] add approx API for RDD
    8df4dad4
    [SPARK-2871] [PySpark] add approx API for RDD
    Davies Liu authored
    RDD.countApprox(self, timeout, confidence=0.95)
    
            :: Experimental ::
            Approximate version of count() that returns a potentially incomplete
            result within a timeout, even if not all tasks have finished.
    
            >>> rdd = sc.parallelize(range(1000), 10)
            >>> rdd.countApprox(1000, 1.0)
            1000
    
    RDD.sumApprox(self, timeout, confidence=0.95)
    
            Approximate operation to return the sum within a timeout
            or meet the confidence.
    
            >>> rdd = sc.parallelize(range(1000), 10)
            >>> r = sum(xrange(1000))
            >>> (rdd.sumApprox(1000) - r) / r < 0.05
    
    RDD.meanApprox(self, timeout, confidence=0.95)
    
            :: Experimental ::
            Approximate operation to return the mean within a timeout
            or meet the confidence.
    
            >>> rdd = sc.parallelize(range(1000), 10)
            >>> r = sum(xrange(1000)) / 1000.0
            >>> (rdd.meanApprox(1000) - r) / r < 0.05
            True
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #2095 from davies/approx and squashes the following commits:
    
    e8c252b [Davies Liu] add approx API for RDD
Loading