How to Read Multiple Csv Files Into a Dataframe Python
PySpark Read CSV file : In this tutorial, I will explain how to create a spark dataframe using a CSV file.
Introduction
CSV is a widely used data format for processing data. The read.csv() function nowadays in PySpark allows you to read a CSV file and save this file in a Pyspark dataframe. We volition therefore encounter in this tutorial how to read one or more CSV files from a local directory and apply the dissimilar transformations possible with the options of the function.
If you need to install spark in your machine, you can consult this beginning of the tutorial :
Pyspark read csv Syntax
To illustrate the different examples, we will become to this file which contains the list of the unlike pokemons. You can download it via this link :
This file contains xiii columns which are as follows :
- Alphabetize
- Proper name
- Type1
- Type2
- Total
- HP
- Attack
- Defense
- Specia
- Atk
- Specia
- Def
- Speed
- Generation
- Legendary
The basic syntax for using the read.csv function is equally follows:
# The path or file is stored spark.read.csv("path")
To read the CSV file as an instance, go on every bit follows:
from pyspark.sql import SparkSession from pyspark.sql import functions as f from pyspark.sql.types import StructType,StructField, StringType, IntegerType , BooleanType spark = SparkSession.builder.appName('pyspark - case read csv').getOrCreate() sc = spark.sparkContext df = spark.read.csv("amiradata/pokedex.csv") df.printSchema() df.show(v,False)
# Effect of the printSchema() root |-- _c0: cord (nullable = true) |-- _c1: string (nullable = truthful) |-- _c2: string (nullable = true) |-- _c3: string (nullable = true) |-- _c4: string (nullable = truthful) |-- _c5: cord (nullable = true) |-- _c6: string (nullable = true) |-- _c7: cord (nullable = true) |-- _c8: string (nullable = true) |-- _c9: string (nullable = true) |-- _c10: string (nullable = true) |-- _c11: string (nullable = true) |-- _c12: cord (nullable = true) # Result of show() function +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ |_c0 |_c1 |_c2 |_c3 |_c4 |_c5|_c6 |_c7 |_c8 |_c9 |_c10 |_c11 |_c12 | +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ |Index|Name |Type1|Type2 |Total|HP |Attack|Defence force|SpecialAtk|SpecialDef|Speed|Generation|Legendary| |1 |Bulbasaur |Grass|Poisonous substance|318 |45 |49 |49 |65 |65 |45 |one |False | |2 |Ivysaur |Grass|Poison|405 |60 |62 |63 |80 |fourscore |60 |1 |Faux | |3 |Venusaur |Grass|Poison|525 |80 |82 |83 |100 |100 |80 |one |False | |3 |VenusaurMega Venusaur|Grass|Poisonous substance|625 |80 |100 |123 |122 |120 |fourscore |one |False | +-----+---------------------+-----+------+-----+---+------+-------+----------+----------+-----+----------+---------+ merely showing top 5 rows
Past default, when only the path of the file is specified, the header is equal to False whereas the file contains a header on the first line. All columns are likewise considered as strings. To solve these problems the read.csv() function takes several optional arguments, the most mutual of which are :
- header : uses the first line equally names of columns. By default, the value is False
- sep : sets a separator for each field and value. By default, the value is comma
- schema : an optional
pyspark.sql.types.StructType
for the input schema or a DDL-formatted cord - path : cord, or list of strings, for input path(s), or RDD of Strings storing CSV rows.
You will find the complete listing of parameters on the official spark website.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read%20csv#pyspark.sql.DataFrameReader
If your file already contains a header on the first line, you must specify it explicitly by declaring the Header parameter to True.
# Specifies the header to True df = spark.read.csv("amiradata/pokedex.csv",header=True) df.printSchema()
root |-- Index: string (nullable = true) |-- Proper name: string (nullable = true) |-- Type1: cord (nullable = true) |-- Type2: string (nullable = truthful) |-- Total: string (nullable = true) |-- HP: cord (nullable = true) |-- Attack: string (nullable = true) |-- Defense: cord (nullable = truthful) |-- SpecialAtk: cord (nullable = true) |-- SpecialDef: string (nullable = true) |-- Speed: string (nullable = true) |-- Generation: string (nullable = true) |-- Legendary: string (nullable = true)
With the printSchema(), nosotros can see that the Header has been taken into consideration.
Read CSV file using a user custom schema
As nosotros have seen, by default, all columns were considered as strings. If we desire to change this, nosotros can use the structures. Once our structure is created we tin specify it in the schema parameter of the read.csv() function.
# Schematic of the table schema = StructType() \ .add("Alphabetize",IntegerType(),True) \ .add("Proper noun",StringType(),True) \ .add together("Type1",StringType(),True) \ .add("Type2",StringType(),True) \ .add("Total",IntegerType(),Truthful) \ .add("HP",IntegerType(),True) \ .add("Attack",IntegerType(),True) \ .add("Defense",IntegerType(),True) \ .add("SpecialAtk",IntegerType(),Truthful) \ .add("SpecialDef",IntegerType(),True) \ .add together("Speed",IntegerType(),Truthful) \ .add("Generation",IntegerType(),Truthful) \ .add("Legendary",BooleanType(),True) df = spark.read.csv("amiradata/pokedex.csv",header=True,schema=schema) df.printSchema()
root |-- Index: integer (nullable = true) |-- Name: string (nullable = truthful) |-- Type1: cord (nullable = true) |-- Type2: string (nullable = true) |-- Total: integer (nullable = truthful) |-- HP: integer (nullable = true) |-- Assault: integer (nullable = truthful) |-- Defense: integer (nullable = true) |-- SpecialAtk: integer (nullable = true) |-- SpecialDef: integer (nullable = true) |-- Speed: integer (nullable = truthful) |-- Generation: integer (nullable = true) |-- Legendary: boolean (nullable = true)
As you can see, the schema has been inverse and contains the types we specified in our Structure.
Read multiple CSV files
With this role information technology is possible to read several files directly (either by listing all the paths of each file or by specifying the folder where your unlike files are located):
# reads the 3 files specified in the PATH parameter df = spark.read.csv("amiradata/pokedex.csv,amiradata/pokedex2.csv,amiradata/pokedex3.csv") # Reads the files in the folder df = spark.read.csv("amiradata/")
Decision
In this tutorial we have learned how to read a CSV file using the read.csv() role in Spark. This part is very useful and we take only seen a tiny function of the options it offers us.
If you want to learn more about PySpark, yous can read this volume : ( As an Amazon Partner, I make a profit on qualifying purchases ) :
In our next commodity, we will run across how to create a CSV file from within Pyspark Dataframe. Pay attention! 🙂
Back to the python section
hamiltonselinglese.blogspot.com
Source: https://amiradata.com/pyspark-read-csv-file-into-pyspark-dataframe/
0 Response to "How to Read Multiple Csv Files Into a Dataframe Python"
Post a Comment