PySpark Read CSV file : In this tutorial, I will explain how to create a spark dataframe using a CSV file.

Introduction

CSV is a widely used data format for processing data. The read.csv() function nowadays in PySpark allows you to read a CSV file and save this file in a Pyspark dataframe. We volition therefore encounter in this tutorial how to read one or more CSV files from a local directory and apply the dissimilar transformations possible with the options of the function.

If you need to install spark in your machine, you can consult this beginning of the tutorial :

Pyspark read csv Syntax

To illustrate the different examples, we will become to this file which contains the list of the unlike pokemons. You can download it via this link :

This file contains xiii columns which are as follows :

  1. Alphabetize
  2. Proper name
  3. Type1
  4. Type2
  5. Total
  6. HP
  7. Attack
  8. Defense
  9. Specia
  10. Atk
  11. Specia
  12. Def
  13. Speed
  14. Generation
  15. Legendary

The basic syntax for using the read.csv function is equally follows:

# The path or file is stored  spark.read.csv("path")                

To read the CSV file as an instance, go on every bit follows:

from pyspark.sql import SparkSession from pyspark.sql import functions as f from pyspark.sql.types import StructType,StructField, StringType, IntegerType , BooleanType spark = SparkSession.builder.appName('pyspark - case read csv').getOrCreate() sc = spark.sparkContext  df = spark.read.csv("amiradata/pokedex.csv") df.printSchema() df.show(v,False)                
# Effect of the printSchema()  root  |-- _c0: cord (nullable = true)  |-- _c1: string (nullable = truthful)  |-- _c2: string (nullable = true)  |-- _c3: string (nullable = true)  |-- _c4: string (nullable = truthful)  |-- _c5: cord (nullable = true)  |-- _c6: string (nullable = true)  |-- _c7: cord (nullable = true)  |-- _c8: string (nullable = true)  |-- _c9: string (nullable = true)  |-- _c10: string (nullable = true)  |-- _c11: string (nullable = true)  |-- _c12: cord (nullable = true)  # Result of show() function  +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ |_c0  |_c1                  |_c2  |_c3   |_c4  |_c5|_c6   |_c7    |_c8       |_c9       |_c10 |_c11      |_c12     | +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ |Index|Name                 |Type1|Type2 |Total|HP |Attack|Defence force|SpecialAtk|SpecialDef|Speed|Generation|Legendary| |1    |Bulbasaur            |Grass|Poisonous substance|318  |45 |49    |49     |65        |65        |45   |one         |False    | |2    |Ivysaur              |Grass|Poison|405  |60 |62    |63     |80        |fourscore        |60   |1         |Faux    | |3    |Venusaur             |Grass|Poison|525  |80 |82    |83     |100       |100       |80   |one         |False    | |3    |VenusaurMega Venusaur|Grass|Poisonous substance|625  |80 |100   |123    |122       |120       |fourscore   |one         |False    | +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ merely showing top 5 rows                

Past default, when only the path of the file is specified, the header is equal to False whereas the file contains a header on the first line. All columns are likewise considered as strings. To solve these problems the read.csv() function takes several optional arguments, the most mutual of which are :

  • header : uses the first line equally names of columns. By default, the value is False
  • sep :  sets a separator for each field and value. By default, the value is comma
  • schema : an optionalpyspark.sql.types.StructType for the input schema or a DDL-formatted cord
  • path : cord, or list of strings, for input path(s), or RDD of Strings storing CSV rows.

You will find the complete listing of parameters on the official spark website.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read%20csv#pyspark.sql.DataFrameReader

If your file already contains a header on the first line, you must specify it explicitly by declaring the Header parameter to True.

# Specifies the header to True  df = spark.read.csv("amiradata/pokedex.csv",header=True) df.printSchema()                
root  |-- Index: string (nullable = true)  |-- Proper name: string (nullable = true)  |-- Type1: cord (nullable = true)  |-- Type2: string (nullable = truthful)  |-- Total: string (nullable = true)  |-- HP: cord (nullable = true)  |-- Attack: string (nullable = true)  |-- Defense: cord (nullable = truthful)  |-- SpecialAtk: cord (nullable = true)  |-- SpecialDef: string (nullable = true)  |-- Speed: string (nullable = true)  |-- Generation: string (nullable = true)  |-- Legendary: string (nullable = true)                

With the printSchema(), nosotros can see that the Header has been taken into consideration.

Read CSV file using a user custom schema

As nosotros have seen, by default, all columns were considered as strings. If we desire to change this, nosotros can use the structures. Once our structure is created we tin specify it in the schema parameter of the read.csv() function.

# Schematic of the table  schema = StructType() \       .add("Alphabetize",IntegerType(),True) \       .add("Proper noun",StringType(),True) \       .add together("Type1",StringType(),True) \       .add("Type2",StringType(),True) \       .add("Total",IntegerType(),Truthful) \       .add("HP",IntegerType(),True) \       .add("Attack",IntegerType(),True) \       .add("Defense",IntegerType(),True) \       .add("SpecialAtk",IntegerType(),Truthful) \       .add("SpecialDef",IntegerType(),True) \       .add together("Speed",IntegerType(),Truthful) \       .add("Generation",IntegerType(),Truthful) \       .add("Legendary",BooleanType(),True)  df = spark.read.csv("amiradata/pokedex.csv",header=True,schema=schema) df.printSchema()                
root  |-- Index: integer (nullable = true)  |-- Name: string (nullable = truthful)  |-- Type1: cord (nullable = true)  |-- Type2: string (nullable = true)  |-- Total: integer (nullable = truthful)  |-- HP: integer (nullable = true)  |-- Assault: integer (nullable = truthful)  |-- Defense: integer (nullable = true)  |-- SpecialAtk: integer (nullable = true)  |-- SpecialDef: integer (nullable = true)  |-- Speed: integer (nullable = truthful)  |-- Generation: integer (nullable = true)  |-- Legendary: boolean (nullable = true)                

As you can see, the schema has been inverse and contains the types we specified in our Structure.

Read multiple CSV files

With this role information technology is possible to read several files directly (either by listing all the paths of each file or by specifying the folder where your unlike files are located):

# reads the 3 files specified in the PATH parameter  df = spark.read.csv("amiradata/pokedex.csv,amiradata/pokedex2.csv,amiradata/pokedex3.csv")  # Reads the files in the folder  df = spark.read.csv("amiradata/")                

Decision

In this tutorial we have learned how to read a CSV file using the read.csv() role in Spark. This part is very useful and we take only seen a tiny function of the options it offers us.

If you want to learn more about PySpark, yous can read this volume : ( As an Amazon Partner, I make a profit on qualifying purchases ) :

In our next commodity, we will run across how to create a CSV file from within Pyspark Dataframe. Pay attention! 🙂

Back to the python section

I'one thousand a data scientist. Passionate near new technologies and programming I created this website mainly for people who want to learn more virtually information science and programming :)

View all of ayed_amira's posts.