Skip to content
Snippets Groups Projects
  • Reynold Xin's avatar
    8b8e70eb
    Merge pull request #73 from falaki/ApproximateDistinctCount · 8b8e70eb
    Reynold Xin authored
    Approximate distinct count
    
    Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
    8b8e70eb
    History
    Merge pull request #73 from falaki/ApproximateDistinctCount
    Reynold Xin authored
    Approximate distinct count
    
    Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
pom.xml 25.56 KiB