Imagine 10 dots in your mind.
Imagine 100 dots in your mind.
Now imagine 2,500,000,000,000 dots, or bytes for that matter. This is simply inconceivable for the human mind. So, what do humans do when they cannot understand something? They analyze it, and as such, Data Science is born.
The field of Data Science is expanding. 90% of the internet was created in the last two years, and 2.5 million terabytes of new data are created daily. Information is evolving into a microcosm in the world economy, with this digital collection of 1’s and 0’s quickly transforming itself into material wealth. But, like all raw materials, raw data must be mined and processed before gaining any inherent value. So, in this journal I will be documenting my journey in learning how to process data sets, and why the Pandas software library has been such an impactful aid.
To begin, you will need to have access to Python. There are multiple IDE’s to choose from before installing Python: Sublime, Atom, PyCharm, etc. Though in this journal, I will be using Eclipse + PyDev. Once Python is installed, we will need to install Pandas - which provides us with a Data Frame structure to store, organize, and clean our dataset.
The "Anaconda" library, which is commonly used in the Data Science community, works as an easy way to install all the tools you will need going forward. But, for the moment, a simple pip-install will suffice.
We'll first check to see if we have pip installed:
If pip is not already installed then we can try to bootstrap it from the default library.
pip --version
python -m ensurepip --default-pip
As modern versions of Python (3.4+) have “pip” auto-installed, expediting the process, if the two methods above do not work, you will likely need to update your Python software.
To find your Python version run the code below:
If your software is older than Python 3.4 then you can download the most recent version here.
python --version
Once your software is up to date and pip is installed we are free to install and check our version of the pandas software using the command below:
import pandas as pd
print(pd.__version__)
Although I will be introducing these data sets in the Problem at Hand section, if you are following along this article and would like an early peak at my choice datasets you may download each linked file:
Arabica Dataset,
Robusta Dataset.
Now, if everything is up to date and running properly, we can begin.
To begin let’s start off with identifying the issue at hand, which will work as our example as we move through the pandas software:
“I wake up groggy after a late night of homework, I need a coffee, but I am out of beans. So, I grab my keys, next thing I know I’m walking into a Target, Whole Foods, Publix, etc. staring down fifty options. I want to know which coffee is the best. Everybody likes coffee, and nobody likes bad coffee. So, how do I determine the best coffee? Is coffee from Ethiopia better than Guatemala? Does altitude matter?”
I will need to either develop or discover a data set that is applicable to my question, and for the purpose of this journal, and time efficiency, I found an applicable data set. Looking on GitHub I found two data sets that cover over 1300 different coffees: the Arabica and Robusta datasets. But, before getting into any coding, an understanding of the question we are asking and the datasets we are using is absolutely necessary.
At first glance a logical question may be, “What is the single best coffee?”, but this question will not allow us to fully utilize the pandas software. You can google which coffee is the highest rated. So, then where does our interest in our datasets lay? Rather than a specific coffee, we will be looking at correlations between location, altitude, ownership, etc. and the coffee’s rating. So the question we are looking to answer is:
“Depending only on it's characteristics, how do we identify a “good” coffee, and what characteristics serve as the most important factors in quality?”
Each of the datasets is based on the same grading rubric, and as such have the same columns. Looking at the raw data we can see characteristics such as altitude, country of origin, flavor, aroma, etc. Each of these variables will play an important part in answering the question above.
On a more technical level, each data set is in the CSV format and have 44 columns. This mass of information will serve us as we move forward answering our question, but some data is more important than others. One of our first steps as we move forward is to actually clean up our datasets further, removing any extraneous data.
Many issues can arise when using other software to process data. Excel, which is the most commonly taught software used to do what we’re doing, can run into many issues as one explores the field of Data Science. These are outlined below.
Data Sets become very large in the field of Data Science. Programs such as Microsoft Excel begin to slow down and crash around 10,000 rows of information, and while 10,000 may seem like a lot, it is just a drop in the bucket of larger data sets. Pandas can run expansive data sets with the only limitation being the hardware it is being run on. Using the chunksize attribute we can further allow slower machines to read through massive files. This allows us to maximize our computer’s memory and again separates pandas from the likes of Microsoft’s excel.
While excel requires us to take time to change our file types before importing, pandas supports the import and export of over 15 file types. From HTML, CSV, PDF, Images, etc. Pandas also allows us to export our data in the same 15 files types. In any instance where the data you have is not the file type you want to export pandas fairs much better than its opponents.
Comparative again to Excel, Pandas has a much more intelligent machine learning foundation. This will repair data, fix holes, remove duplicates - allowing us the privilege of not hunting down issues throughout thousands of data points. While it may seem pretty simple to fix a few holes and patch a few leaks, when your data set runs hundreds of columns by millions of rows this becomes "brain-bleedingly" difficult.
In this section I will be traversing my general thought process while trying to accomplish my goal - finding a damn good cup of coffee.
I am posting my code as well as an abbreviated view of the output, and I am adding a print()
function at the end of each segment to help with visualization.
This output will be in the grey box beneath out input.
We have downloaded our software, identified our problem, and gained a solid understanding of the question we are asking.
So, it is time to begin the fun part.
The first step when using Pandas is to import your data. For the purpose of this journal I will be demonstrating the importation of CSV files, other file type will work the exact same way. So, let’s begin by importing our data sets:
arabica_data = pd.read_csv('arabica_data.csv')
robusta_data = pd.read_csv('robusta_data.csv')
print(arabica_data)
print(robusta_data)
As seen above we are actually going to be working through two separate datasets. While we could run the same commands through each data set, it is much easier to combine them. There are multiple ways to combine data sets, but thanks to our parallel columns in each data set we can concatenate the two together, becoming the new data set - comb:
comb = pd.concat([arabica_data, robusta_data])
print(comb)
While the next few steps aren’t absolutely necessary, they helped me go through this project a bit quicker. First, I went ahead and checked what all my columns were named in the raw code with the column attribute. This allowed me to see the different variables that were included, allowing me to identify the most important:
print(comb.columns)
Now we have a list of every variable, after doing this I wanted to see how our data was laid out, were the ratings out of 100/10/etc.? This attribute allows the print of a single row/coffee along with its data. For example, I wanted to see all the properties of row four:
print(comb.iloc[4])
I wanted to see the mean of all my data, as long as it was numerically applicable. By doing this I found a “cut-off level” to begin removing some of the more extraneous data. While not always necessary, I chose to do this because we are only looking for information on the best coffee:
print(comb.describe())
As I stated earlier, printing off my column descriptors allowed me to realize there were two columns on defects, so I chose to combine these into one column, creating our 45th column. This process can use data from columns to create new columns:
comb['Total.Defects'] = comb['Category.One.Defects'] + comb['Category.Two.Defects']
print(comb['Total.Defects'])
After creating this new column, I decided it was time to cut down from our 45 columns, many of which were either repetitive or lacking information. Above we learned how to add new columns, and below we will learn the inverse. I printed the remaining columns to make sure I didn’t accidentally delete any useful information:
comb = comb.drop(columns=['Farm.Name',
'Lot.Number', 'Mill', 'ICO.Number', 'Altitude',
'Region', 'Producer', 'Number.of.Bags', 'Bag.Weight',
'Harvest.Year', 'Owner', 'Variety',
'Processing.Method', 'Category.One.Defects',
'Quakers', 'Color', 'Category.Two.Defects',
'Expiration', 'Certification.Body',
'Certification.Address', 'Certification.Contact',
'unit_of_measurement', 'altitude_low_meters',
'altitude_high_meters', 'Moisture',
'In.Country.Partner', 'Owner.1', 'Fragrance...Aroma',
'Salt...Acid', 'Bitter...Sweet', 'Mouthfeel',
'Uniform.Cup'])
print(comb.columns)
Now we have practiced with a few pandas commands and turned two chunky datasets into a streamlined version of their previous selves. Practicing the addition and subtraction of columns, alongside learning how to visualize specific data plots will come in handy in our next section.
In this section I focus on answering the first half of our question:
“Depending only on it's characteristics, how do we identify a “good” coffee, and what characteristics serve as the most important factors in quality?”
Using my personal opinions to determine what a good coffee looks like to me, I decided what characteristics were the most important to me, and the first characteristic I saw was the ‘Total_Cup_Points’
, which is the sum of all the graded values.
So, I figured that if the sum was less than eighty-three points then I didn’t want it included in my “good” coffee dataset:
Also, pay close attention to the number of columns we have as we clean our dataset.
comb = comb.loc[(comb['Total.Cup.Points'] > 83)]
print(comb)
Nobody wants to buy a bag of coffee with a lot of defects, so this became my next factor - with less than five total defects per bag being necessary, and I was able to use the column I created earlier:
comb = comb.loc[(comb['Total.Defects'] < 5)]
print(comb)
I decided going one factor at a time was becoming inefficient so I ran through multiple factors at once, adjusting the “cut off level” until I found a balance between having enough brands to give a good overview, but now too many that it includes “bad” coffee (I ended up deciding that 83 was a bit too low.):
comb = comb.loc[(comb['Aroma'] > 7.75) & (comb['Flavor'] > 7.75) & (comb['Aftertaste'] > 7.75) | (comb['Total.Cup.Points'] > 85)]
print(comb)
Upon getting down to 92 columns/brands I decided I had found equilibrium and decided to move past the processing of the data, into the final steps that would answer my question.
Now we begin answering the second half of our question:
“Depending only on it's characteristics, how do we identify a “good” coffee, and what characteristics serve as the most important factors in quality?”
In this section we will analyze aspects of our new dataset, which I re-indexed and renamed to be 'new_comb'
.
For all these calculations we will be using “groupby” attributes.
This cleaned dataset that will allow us to finally decide which coffees give us the greatest chance of a good cup:
new_comb = comb
new_comb.reset_index(drop=True, inplace=True)
print(new_comb)
So, as seen all of our data has been crunched together within our outputs as ‘…’. Which does not permit the visualization of all of our data. So, I will change the display of the output to show all of the data. I am only going to do this for one OUTPUT example as to keep the article shorter. I just want to demonstrate the visual change caused below:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 100)
print(comb)
So my first question is, because I notice that the Country of origin of a coffee is often displayed, “Does 'Country_of_Origin’
actually influence the taste?”
The code below prints the means of all coffees' columns organized by their "birthplace":
print(new_comb.groupby(['Country.of.Origin']).mean())
This is where I became a bit distraught. All countries had very similar stats when it came to their “best” coffees, which logically makes sense. So rather than looking at the means as a way of finding which country has the “best” coffee, I decided to see which Country had the highest probability of giving me a good brew. I did this by creating a sum of how many of the 92 coffees were from each country:
country_count = new_comb.groupby(['Country.of.Origin']).count()print (country_count['Unnamed: 0'])
Next, just because I always hear the word, I chose to see if “altitude” had any direct correlation to quality of coffee, using both the mean and count method to test for correlation:
print(new_comb.groupby(['altitude_mean_meters']).mean())
altitude_count = new_comb.groupby(['altitude_mean_meters']).count()
print (altitude_count['Unnamed: 0'])
Lastly, primarily for learning purposes,
I wanted to test for a relation between 'Country.of.Origin'
and 'Total_Cup_Points'
:
print(new_comb.groupby(['Country.of.Origin', 'Total.Cup.Points']).count()['Unnamed: 0'])
( Again, I am sorry for the length but this will serve as very important data in our Conclusion section. <3 )
With my experimenting, cleaning, and processing done, it was time to use my findings to come to some conclusions. Hopefully with these conclusions I can go anywhere with an understanding of which coffees will have the highest probability of pleasing my taste buds.
I. The Outputs XIV and XVI showed me that while all countries have relatively similar numbers, due to the fact that I got rid of all the worst coffees, using the count attribute I saw that by far, there were more brands from Ethiopia and America that survived the cleaning. With thirty-two brands between them, these two countries provided 35% of the best coffees. So, when given the choice, American and Ethiopian coffees are more likely to deliver a pleasurable experience.
II. Digging through Output XV I realized that altitude is largely a buzzword meant to entice you into thinking a coffee is superior because it is grown at a different altitude. Regardless of the column looked at, nothing changed linearly as you went higher in altitude. Altitude remained a random variable in relation to taste and total point counts. The placebo effect and ingenious marketing powerful forces, no?
III. Lastly, running through Output XVI I noticed that while the U.S. had a lot of coffees that ended up in the final 92, none of the these had a coffee above 88 total cup points. Ethiopia had 44% of their coffees in the final 92 scoring above an 88. Not only do they hold a higher percentage of our "Top 92", but they hold this with a higher overall score than their biggest contenders.
The number one characteristic to having a good probability of getting a “good” coffee, at least in relation to my tastes, belongs solely to the country of origin.
And, running through multiple tests and experimenting with my variables, I have decided that the country that has the highest chance of giving me the finest coffee is Ethiopia.
Next time I go into Wholefoods I will search for my next mug of beans with two major conclusions. One, Altitude is pointless, and just serves as a way to upsell the same beans. Two, a coffee from America will probably be fine,
but one from Ethiopia will give me the highest chance of buying the perfect grinds for myself.
Now, while we all enjoy coffee, we also all have our own opinions on a "good characteristic." Feel free to download pandas, run through the tutorial one more time, and enjoy Python's method to hunting for your own perfect cup.