Technology / Programming

14 File Types You Can Import Into pandas

by Team Nuggets
14 File Types You Can Import Into pandas picture: A
Follow us
Published on October 23, 2018

As a data scientist or analyst, you'll probably come across many file types to import and use in your Python scripts. Some analysts use Microsoft Excel, but the application limits what you can do with large data imports. The better option is pandas — a powerful analysis toolkit that's much more intuitive for a data scientist.

What file formats can pandas use?

Python can handle virtually any data file format — much more than Microsoft Excel. That's the strength of Python. It's open source, and there's probably a library out there to handle it, so you get a vastly more compatible system.

Python can work with the following file formats:

  1. Comma-separated values (CSV)

  2. XLSX

  3. ZIP

  4. Plain Text (txt)

  5. JSON

  6. XML

  7. HTML

  8. Images

  9. Hierarchical Data Format

  10. PDF

  11. DOCX

  12. MP3

  13. MP4

  14. SQL

How to install pandas

Installing pandas is easy. Open your terminal and use this code:

1
2  <span style="color: #888888;">[code language="shell"]
3  <span style="color: #888888;">pip install pandas
4  <span style="color: #888888;">[/code]
5

For Python3 you'll need to use:

1
2  <span style="color: #888888;">[code language="shell"]
3  <span style="color: #888888;">pip3 install pandas
4  <span style="color: #888888;">[/code]
5

If you're getting a sea of red Kills, you might need to a sudo command and then enter your password:

1  <span style="color: #888888;">sudo –H pip install pandas
2

How to import files in Python

With almost any data science project, you need to import files. pandas makes these imports intuitive, but even better it's done with only a few lines of code.

How to import HTML with pandas

HTML is usually unstructured data, but you could have a page with a table of data that you want to import. Pandas reads HTML in just one line of code and parses the data for you, including the tag elements.

[code language<span style="color: #333333;">=<span style="background-color: #fff0f0;">"python"]
<span style="color: #008800; font-weight: bold;">import <span style="color: #0e84b5; font-weight: bold;">pandas <span style="color: #008800; font-weight: bold;">as <span style="color: #0e84b5; font-weight: bold;">pd
tables <span style="color: #333333;">= pd<span style="color: #333333;">.read_html(<span style="background-color: #fff0f0;">'https://www.cbtnuggets.com/blog/technology/data/13-honest-data-center-technician-salaries-in-2018')
<span style="color: #008800;"><b>tables.head()</b>
[<span style="color: #333333;">/code language

NOTE: This is only for pages with tables. Pandas works its magic easily when a page has a table, like our 13 Honest Data Center Tech Salaries post.

If you want to do some serious scraping, you should look into Beautiful Soup.

How to import SQL with pandas

You could run a SQL query on a relational database and export data to a structured file format, or you can run a SQL statement directly in your Python scripts. Pandas has a function that takes a SQL string and runs it against the relational database.

It's a little trickier because you need to write a SQL query, and also open a connection to a SQL database. Don't let that discourage you. SQL is straightforward, and it's easy once you get the connection set up with Python library pyodbc.

[code language<span style="color: #333333;">=<span style="background-color: #fff0f0;">"python"]
<span style="color: #008800; font-weight: bold;">import <span style="color: #0e84b5; font-weight: bold;">pandas <span style="color: #008800; font-weight: bold;">as <span style="color: #0e84b5; font-weight: bold;">pd
<span style="color: #008800; font-weight: bold;">import <span style="color: #0e84b5; font-weight: bold;">pyodbc
cnxn <span style="color: #333333;">= pyodbc<span style="color: #333333;">.connect(<span style="color: #ff0000; background-color: #ffaaaa;">'connection_string<span style="color: #ff0000; background-color: #ffaaaa;">')
sql <span style="color: #333333;">= (<span style="background-color: #fff0f0;">"SELECT first_name, last_name FROM Customers")
data <span style="color: #333333;">= pd<span style="color: #333333;">.read_sql(sql, cnxn)
[<span style="color: #333333;">/code]

You'll definitely want to learn more about SQL connections in Ben Finkel's pandas course.

How to import a CSV with pandas

Pandas also makes imports of a CSV easy.

[code language<span style="color: #333333;">=<span style="background-color: #fff0f0;">"python"]
<span style="color: #008800; font-weight: bold;">import <span style="color: #0e84b5; font-weight: bold;">pandas <span style="color: #008800; font-weight: bold;">as <span style="color: #0e84b5; font-weight: bold;">pd
data <span style="color: #333333;">= pd<span style="color: #333333;">.read_csv(<span style="background-color: #fff0f0;">'mydata.csv', sep <span style="color: #333333;">= <span style="color: #ff0000; background-color: #ffaaaa;">',<span style="color: #ff0000; background-color: #ffaaaa;">')
[<span style="color: #333333;">/code]

Importantly, you can put virtually any file format into the pd.read_[file format] section, and it'll work the same way.

How to export formats

After you import data, you might perform calculations and arrange new data structures in your scripts. You might then want to export data to a new file that contains your calculations.

Pandas has a function that exports data to a CSV. You can export using other file formats, but most data scientists prefer to export in a file format that's more easily transferred between systems. We mentioned that CSV is used to transfer data in an example environment between MySQL and SQL Server, but it's used in other situations where you need to ensure that a third-party is able to import your file.

You might not know what system your client uses or need to export data for a third-party to use. Using a CSV file format, you can better ensure that any other software, database engine, or compiled code can import your information without incompatibility issues.

How to export a CSV with pandas

[code language<span style="color: #333333;">=<span style="background-color: #fff0f0;">"python"]
<span style="color: #008800; font-weight: bold;">import <span style="color: #0e84b5; font-weight: bold;">pandas <span style="color: #008800; font-weight: bold;">as <span style="color: #0e84b5; font-weight: bold;">pd
data<span style="color: #333333;">.to_csv(<span style="background-color: #fff0f0;">'example.csv')
[<span style="color: #333333;">/code]

With this newly exported file, you or any third-party can import it to another system including another Python script. These few lines of code are all that it takes to import data, and pandas makes it possible.

How to convert any file to an Excel file with pandas

CSV file formats are one of the most common. A CSV file is used to transport data between two different platforms. For instance, a CSV file can be used to transfer data between SQL Server and MySQL. It's a universal format that also requires very little formatting overhead.

How to convert a JSON file to a CSV with pandas

JSON files are structured for computers, not humans. Data analysts might need to get a better look at data in an easy-to-read format. With pandas, that's easy.

[code language<span style="color: #333333;">=<span style="background-color: #fff0f0;">"python"]
<span style="color: #008800; font-weight: bold;">import <span style="color: #0e84b5; font-weight: bold;">pandas <span style="color: #008800; font-weight: bold;">as <span style="color: #0e84b5; font-weight: bold;">pd
<span style="color: #008800; font-weight: bold;">import <span style="color: #0e84b5; font-weight: bold;">simplejson
df <span style="color: #333333;">= pd<span style="color: #333333;">.read_json(<span style="color: #ff0000; background-color: #ffaaaa;">'name<span style="color: #333333;">.json<span style="background-color: #fff0f0;">')
df<span style="color: #333333;">.to_csv(<span style="background-color: #fff0f0;">'name.csv', index<span style="color: #333333;">=<span style="color: #007020;">False)
[<span style="color: #333333;">/code]

Library tip: The simplejson library provides line-by-line Kill messages that make life much easier. For instance, if your .json file isn't formatted correctly, simplejson will tell you exactly which line is faulty. The json library only tells you there's an issue (or issues) somewhere in a huge, not-so-easy-to-read file.

You can convert virtually any data file format into any other format with pandas.

Conclusion

As a data scientist, you need to use the right libraries and file formats that help your career. When you become familiar with file formats and the functions that easily import and export data, you are a more versatile data scientist with extended skills that you bring to potential employers.


Download

By submitting this form you agree to receive marketing emails from CBT Nuggets and that you have read, understood and are able to consent to our privacy policy.


Don't miss out!Get great content
delivered to your inbox.

By submitting this form you agree to receive marketing emails from CBT Nuggets and that you have read, understood and are able to consent to our privacy policy.

Recommended Articles

Get CBT Nuggets IT training news and resources

I have read and understood the privacy policy and am able to consent to it.

© 2024 CBT Nuggets. All rights reserved.Terms | Privacy Policy | Accessibility | Sitemap | 2850 Crescent Avenue, Eugene, OR 97408 | 541-284-5522