DFRW 1 - CSV(Python)
Loading...

Reading Data - CSV Files

Technical Accomplishments:

  • Start working with the API documentation
  • Introduce the class SparkSession and other entry points
  • Introduce the class DataFrameReader
  • Read data from:
    • CSV without a Schema.
    • CSV with a Schema.

Spark Logo Tiny Classroom-Setup

For each lesson to execute correctly, please make sure to run the Classroom-Setup cell at the start of each lesson (see the next cell) and the Classroom-Cleanup cell at the end of each lesson.

%run "../Includes/Classroom-Setup"

Spark Logo Tiny Entry Points

Our entry point for Spark 2.0 applications is the class SparkSession.

An instance of this object is already instantiated for us which can be easily demonstrated by running the next cell:

print(spark)
<pyspark.sql.session.SparkSession object at 0x7f475b897610>

It's worth noting that in Spark 2.0 SparkSession is a replacement for the other entry points:

  • SparkContext, available in our notebook as sc.
  • SQLContext, or more specifically it's subclass HiveContext, available in our notebook as sqlContext.
print(sc)
print(sqlContext)
<SparkContext master=local[8] appName=Databricks Shell> <pyspark.sql.context.SQLContext object at 0x7f475b897590>