When you are working with Python, you have two different options for development: Either use Pythons REPL (Read-Execute-Print-Loop) interface for interactive development. These teams also build complex data processing chains with PySpark. I also know of software development teams which already know Python very well and try to avoid to learn Scala in order to focus on data processing and not on learning a new language. The integration of Python with Spark allows me to mix Spark code to process huge amounts of data with other powerful Python frameworks like Numpy, Pandas and of course Matplotlib. Fortunately Spark comes out-of-the-box with high quality interfaces to both Python and R, so you can use Spark to process really huge datasets inside the programming language of your choice (that is, if Python or R is the programming language of your choice, of course). But on the other hand, people who are deeply into data analytics often feel more comfortable with simpler languages like Python or dedicated statistical languages like R. Initially only Scala and Java bindings were available for Spark, since it is implemented in Scala itself and runs on the JVM. Currently Apache Spark with its bindings PySpark and SparkR is the processing tool of choice in the Hadoop Environment.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |