-
- Downloads
Merge pull request #146 from JoshRosen/pyspark-custom-serializers
Custom Serializers for PySpark This pull request adds support for custom serializers to PySpark. For now, all Python-transformed (or parallelize()d RDDs) are serialized with the same serializer that's specified when creating SparkContext. For now, PySpark includes `PickleSerDe` and `MarshalSerDe` classes for using Python's `pickle` and `marshal` serializers. It's pretty easy to add support for other serializers, although I still need to add instructions on this. A few notable changes: - The Scala `PythonRDD` class no longer manipulates Pickled objects; data from `textFile` is written to Python as MUTF-8 strings. The Python code performs the appropriate bookkeeping to track which deserializer should be used when reading an underlying JavaRDD. This mechanism could also be used to support other data exchange formats, such as MsgPack. - Several magic numbers were refactored into constants. - Batching is implemented by wrapping / decorating an unbatched SerDe.
No related branches found
No related tags found
Showing
- core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 45 additions, 104 deletions...rc/main/scala/org/apache/spark/api/python/PythonRDD.scala
- python/epydoc.conf 1 addition, 1 deletionpython/epydoc.conf
- python/pyspark/accumulators.py 4 additions, 2 deletionspython/pyspark/accumulators.py
- python/pyspark/context.py 50 additions, 21 deletionspython/pyspark/context.py
- python/pyspark/rdd.py 54 additions, 43 deletionspython/pyspark/rdd.py
- python/pyspark/serializers.py 252 additions, 49 deletionspython/pyspark/serializers.py
- python/pyspark/tests.py 2 additions, 1 deletionpython/pyspark/tests.py
- python/pyspark/worker.py 19 additions, 25 deletionspython/pyspark/worker.py
- python/run-tests 1 addition, 0 deletionspython/run-tests
Loading
Please register or sign in to comment