Data Visualization¶

This notebook focuses on visualizing survey and job data using Python data visualization libraries.

Data Visualization Lab¶

Estimated time needed: 45 to 60 minutes

In this assignment you will be focusing on the visualization of data.

The data set will be presented to you in the form of a RDBMS.

You will have to use SQL queries to extract the data.

Objectives¶

In this lab you will perform the following:

  • Visualize the distribution of data.

  • Visualize the relationship between two features.

  • Visualize composition of data.

  • Visualize comparison of data.


Demo: How to work with database¶

Download database file.

In [22]:
# !wget --no-check-certificate https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m4_survey_data.sqlite

Connect to the database.

In [23]:
import sqlite3
conn = sqlite3.connect("m4_survey_data.sqlite") # open a database connection

Import pandas module.

In [24]:
import pandas as pd
import seaborn as sns
import numpy as np

# use the inline backend to generate the plots within the browser
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.style.use('ggplot') # optional: for ggplot-like style

# check for latest version of Matplotlib
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0
Matplotlib version:  3.5.1

Demo: How to run an sql query¶

In [25]:
# print how many rows are there in the table named 'master'
QUERY = """
SELECT COUNT(*)
FROM master
"""

# the read_sql_query runs the sql query and returns the data as a dataframe
df = pd.read_sql_query(QUERY,conn)
df.head()
Out[25]:
COUNT(*)
0 11398

Demo: How to list all tables¶

In [26]:
# print all the tables names in the database
QUERY = """
SELECT name as Table_Name FROM
sqlite_master WHERE
type = 'table'
"""
# the read_sql_query runs the sql query and returns the data as a dataframe
pd.read_sql_query(QUERY,conn)
Out[26]:
Table_Name
0 EduOther
1 DevType
2 LastInt
3 JobFactors
4 WorkPlan
5 WorkChallenge
6 LanguageWorkedWith
7 LanguageDesireNextYear
8 DatabaseWorkedWith
9 DatabaseDesireNextYear
10 PlatformWorkedWith
11 PlatformDesireNextYear
12 WebFrameWorkedWith
13 WebFrameDesireNextYear
14 MiscTechWorkedWith
15 MiscTechDesireNextYear
16 DevEnviron
17 Containers
18 SOVisitTo
19 SONewContent
20 Gender
21 Sexuality
22 Ethnicity
23 master

Demo: How to run a group by query¶

In [27]:
QUERY = """
SELECT Age,COUNT(*) as count
FROM master
group by age
order by age
"""
pd.read_sql_query(QUERY,conn)
Out[27]:
Age count
0 NaN 287
1 16.0 3
2 17.0 6
3 18.0 29
4 19.0 78
5 20.0 109
6 21.0 203
7 22.0 406
8 23.0 581
9 24.0 679
10 25.0 738
11 26.0 720
12 27.0 724
13 28.0 787
14 29.0 697
15 30.0 651
16 31.0 531
17 32.0 489
18 33.0 483
19 34.0 395
20 35.0 393
21 36.0 308
22 37.0 280
23 38.0 279
24 39.0 232
25 40.0 187
26 41.0 136
27 42.0 162
28 43.0 100
29 44.0 95
30 45.0 85
31 46.0 66
32 47.0 68
33 48.0 64
34 49.0 66
35 50.0 57
36 51.0 29
37 52.0 41
38 53.0 32
39 54.0 26
40 55.0 13
41 56.0 16
42 57.0 11
43 58.0 12
44 59.0 11
45 60.0 2
46 61.0 10
47 62.0 5
48 63.0 7
49 65.0 2
50 66.0 1
51 67.0 1
52 69.0 1
53 71.0 2
54 72.0 1
55 99.0 1

Demo: How to describe a table¶

In [28]:
table_name = 'master'  # the table you wish to describe

QUERY = """
SELECT sql FROM sqlite_master
WHERE name= '{}'
""".format(table_name)

df = pd.read_sql_query(QUERY,conn)
print(df.iat[0,0])
CREATE TABLE "master" (
"index" INTEGER,
  "Respondent" INTEGER,
  "MainBranch" TEXT,
  "Hobbyist" TEXT,
  "OpenSourcer" TEXT,
  "OpenSource" TEXT,
  "Employment" TEXT,
  "Country" TEXT,
  "Student" TEXT,
  "EdLevel" TEXT,
  "UndergradMajor" TEXT,
  "OrgSize" TEXT,
  "YearsCode" TEXT,
  "Age1stCode" TEXT,
  "YearsCodePro" TEXT,
  "CareerSat" TEXT,
  "JobSat" TEXT,
  "MgrIdiot" TEXT,
  "MgrMoney" TEXT,
  "MgrWant" TEXT,
  "JobSeek" TEXT,
  "LastHireDate" TEXT,
  "FizzBuzz" TEXT,
  "ResumeUpdate" TEXT,
  "CurrencySymbol" TEXT,
  "CurrencyDesc" TEXT,
  "CompTotal" REAL,
  "CompFreq" TEXT,
  "ConvertedComp" REAL,
  "WorkWeekHrs" REAL,
  "WorkRemote" TEXT,
  "WorkLoc" TEXT,
  "ImpSyn" TEXT,
  "CodeRev" TEXT,
  "CodeRevHrs" REAL,
  "UnitTests" TEXT,
  "PurchaseHow" TEXT,
  "PurchaseWhat" TEXT,
  "OpSys" TEXT,
  "BlockchainOrg" TEXT,
  "BlockchainIs" TEXT,
  "BetterLife" TEXT,
  "ITperson" TEXT,
  "OffOn" TEXT,
  "SocialMedia" TEXT,
  "Extraversion" TEXT,
  "ScreenName" TEXT,
  "SOVisit1st" TEXT,
  "SOVisitFreq" TEXT,
  "SOFindAnswer" TEXT,
  "SOTimeSaved" TEXT,
  "SOHowMuchTime" TEXT,
  "SOAccount" TEXT,
  "SOPartFreq" TEXT,
  "SOJobs" TEXT,
  "EntTeams" TEXT,
  "SOComm" TEXT,
  "WelcomeChange" TEXT,
  "Age" REAL,
  "Trans" TEXT,
  "Dependents" TEXT,
  "SurveyLength" TEXT,
  "SurveyEase" TEXT
)

Hands-on Lab¶

Visualizing distribution of data¶

Histograms¶

Plot a histogram of ConvertedComp.

In [29]:
# your code goes here

QUERY = """
SELECT ConvertedComp
FROM master
"""
df = pd.read_sql_query(QUERY,conn)
df["ConvertedComp"].plot.hist()
plt.title('Histogram of Salary converted to annual USD salaries using the exchange rate on 2019-02-01')
plt.ylabel('Number of Convertion')
plt.xlabel('Number of Salaries')

plt.show()
No description has been provided for this image

Box Plots¶

Plot a box plot of Age.

In [30]:
# your code goes here
QUERY = """
SELECT Age
FROM master
"""
df = pd.read_sql_query(QUERY,conn)

df["Age"].plot(kind='box', figsize=(10, 8))

plt.title('Box plot of Age and Number of People')
plt.ylabel('Number of People')

plt.show()
No description has been provided for this image

Visualizing relationships in data¶

Scatter Plots¶

Create a scatter plot of Age and WorkWeekHrs.

In [31]:
# your code goes here
QUERY = """
SELECT Age,WorkWeekHrs
FROM master
"""
df = pd.read_sql_query(QUERY,conn)

df.plot(kind='scatter', x='Age', y='WorkWeekHrs', figsize=(10, 6), color='darkblue')

plt.title('Work Week Hours of people by their age ')
plt.xlabel('Age')
plt.ylabel('Work Week Hours')

plt.show()
No description has been provided for this image

Bubble Plots¶

Create a bubble plot of WorkWeekHrs and CodeRevHrs, use Age column as bubble size.

In [32]:
# your code goes here
QUERY = """
SELECT Age,WorkWeekHrs,CodeRevHrs
FROM master
"""
df = pd.read_sql_query(QUERY,conn)
df.head()

norm_age = (df['Age'] - df['Age'].min()) / (df['Age'].max() - df['Age'].min())

ax = df.plot(kind='scatter',
                    x='WorkWeekHrs',
                    y='CodeRevHrs',
                    figsize=(14, 8),
                    alpha=0.5,      # transparency
                    color='green',
                    s=norm_age *1000 # pass in weights 
        
                   )

ax.set_ylabel('CodeRevHrs')
ax.set_title('Work Week Hours and Code Revision Hours by Age of people')
ax.legend(['Age'], loc='upper right', fontsize='x-large')
Out[32]:
<matplotlib.legend.Legend at 0x24809efa550>
No description has been provided for this image

Visualizing composition of data¶

Pie Charts¶

Create a pie chart of the top 5 databases that respondents wish to learn next year. Label the pie chart with database names. Display percentages of each database on the pie chart.

In [33]:
# your code goes here

QUERY = """
SELECT count(DatabaseDesireNextYear) as Count,DatabaseDesireNextYear
FROM DatabaseDesireNextYear
Group BY DatabaseDesireNextYear
ORDER BY Count desc
LIMIT 5
"""
df = pd.read_sql_query(QUERY,conn)
explode_list = [ 0, 0, 0.1, 0.1, 0.2] # ratio for each database with which to offset each wedge.

df["Count"].plot(kind='pie',
                            figsize=(14,8),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,                 # turn off labels on pie chart
                            pctdistance=1.12,            # the ratio between the pie center and start of text label
                            explode=explode_list         # 'explode' lowest 3 databases
                            )
# scale the title up by 12% to match pctdistance
plt.title('Database Desire Next Year by StackOverflow Survey 2019', y=1.12) 

plt.axis('equal') 

# add legend
plt.legend(labels=df["DatabaseDesireNextYear"], loc='upper left')
Out[33]:
<matplotlib.legend.Legend at 0x2480a0cb0d0>
No description has been provided for this image
In [34]:
QUERY="""
SELECT DatabaseWorkedWith,count(Respondent)
FROM DatabaseWorkedWith
group by Respondent having count(DatabaseWorkedWith)=1 and DatabaseWorkedWith='MySQL'"""
df2=pd.read_sql_query(QUERY,conn)
df2.count()
Out[34]:
DatabaseWorkedWith    474
count(Respondent)     474
dtype: int64
In [35]:
QUERY = """
SELECT count(LanguageWorkedWith) as Count,LanguageWorkedWith
FROM LanguageWorkedWith
Group BY LanguageWorkedWith
ORDER BY Count desc
LIMIT 30
"""
df = pd.read_sql_query(QUERY,conn)
df.head(30)
Out[35]:
Count LanguageWorkedWith
0 8687 JavaScript
1 7830 HTML/CSS
2 7106 SQL
3 4642 Bash/Shell/PowerShell
4 4542 Python
5 4506 Java
6 4288 C#
7 3232 TypeScript
8 2913 PHP
9 1946 C++
10 1578 C
11 1149 Ruby
12 1114 Go
13 840 Other(s):
14 751 Kotlin
15 707 Swift
16 628 VBA
17 585 R
18 518 Objective-C
19 492 Scala
20 437 Assembly
21 324 Rust
22 237 Dart
23 187 Elixir
24 164 Clojure
25 158 F#
26 133 WebAssembly
27 98 Erlang
In [36]:
# your code goes here

QUERY = """
SELECT count(LanguageDesireNextYear) as Count,LanguageDesireNextYear
FROM LanguageDesireNextYear
Group BY LanguageDesireNextYear
ORDER BY Count desc
LIMIT 5
"""
df = pd.read_sql_query(QUERY,conn)
explode_list = [ 0, 0, 0.1, 0.1, 0.2] # ratio for each database with which to offset each wedge.

df["Count"].plot(kind='pie',
                            figsize=(14, 8),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,                 # turn off labels on pie chart
                            pctdistance=1.12,            # the ratio between the pie center and start of text label
                            explode=explode_list         # 'explode' lowest 3 databases
                            )
# scale the title up by 12% to match pctdistance
plt.title('Language Desire Next Year by StackOverflow Survey 2019', y=1.12) 

plt.axis('equal') 

# add legend
plt.legend(labels=df["LanguageDesireNextYear"], loc='upper left') 
Out[36]:
<matplotlib.legend.Legend at 0x24809fcdaf0>
No description has been provided for this image

Stacked Charts¶

Create a stacked chart of median WorkWeekHrs and CodeRevHrs for the age group 30 to 35.

In [37]:
# your code goes here
QUERY = """
SELECT WorkWeekHrs,CodeRevHrs,Age
FROM Master
Where Age between 30 and 35

"""
df = pd.read_sql_query(QUERY,conn)
df = df.groupby("Age").median()
df.plot(kind='bar', stacked=True, figsize=(10, 6), rot=90) # rotate the xticks(labelled points on x-axis) by 90 degrees

plt.xlabel('Age')
plt.ylabel('Hours')
plt.title('Median Work Week Hours and Code Revision hours in 30-35 age group')
Out[37]:
Text(0.5, 1.0, 'Median Work Week Hours and Code Revision hours in 30-35 age group')
No description has been provided for this image

Visualizing comparison of data¶

Line Chart¶

Plot the median ConvertedComp for all ages from 45 to 60.

In [38]:
# your code goes here
QUERY = """
SELECT ConvertedComp,Age
FROM Master
Where Age between 45 and 60

"""
df = pd.read_sql_query(QUERY,conn)
df = df.groupby("Age").median()
df.plot(kind='line',figsize=(10, 6))

plt.title('Median ConvertedComp in 45-60 age group')
plt.ylabel('ConvertedComp')
plt.xlabel('Age')

plt.show() # need this line to show the updates made to the figure
No description has been provided for this image

Bar Chart¶

Create a horizontal bar chart using column MainBranch.

In [44]:
# your code goes here
QUERY = """
SELECT MainBranch,count(MainBranch) as count
FROM Master
Group by MainBranch
"""
df = pd.read_sql_query(QUERY,conn)

df = df.set_index("MainBranch")
df["count"]=df["count"].astype('int')

df.rename(columns={'count':'Total People Opinion'}, inplace=True)

fig, ax = plt.subplots(figsize=(12,6))
# df.rename(columns={'count':'Total People Opinion'}, inplace=True)

df.plot(kind='barh', legend = False, ax=ax, color='aquamarine')
ax.set_xlabel('Number of People',color='black',fontfamily="Montserrat",fontsize=14)
ax.set_ylabel('Main Branch',color='black',fontfamily="Montserrat",fontsize=14)
ax.set_xlabel('Number of People')
ax.set_title('People Opinion about MainBranch',color='black',fontfamily="Montserrat",fontsize=20,loc='left')

ax.set_facecolor('whitesmoke')

ax.tick_params(axis='x', colors='black')    #setting up X-axis tick color to black
ax.tick_params(axis='y', colors='black')    #setting up Y-axis tick color to black

plt.rcParams["font.family"] = "Montserrat"
plt.grid(visible=None)

# annotate value labels to each country
for i in range(len(df)): #enamurate returns tuple
    # print(df["Total People Opinion"][i])
    value = df["Total People Opinion"][i]
    label = format(int(value), ',') # format int with commas

# place text at the end of bar (subtracting 700 from x, and 0.1 from y to make it fit within the bar)
    plt.annotate(label, xy=(value-700, i-0.1), color='black')
plt.savefig("mainbranch.png",bbox_inches='tight')
plt.show()
No description has been provided for this image

Close the database connection.

In [40]:
# conn.close()