Interactive data analysis on a local dataset
In the example below you can see a simple data analysis based on Facebook profile data. Facebook uses a secret algorithm to create a buddy order for the profiles. The order of friends depends on many factors, but definitely includes the event of the profile page viewed by friends.
For quick data manipulation and data visualization I use the Jupyter Notebook. It is very helpful for data calculation, manipulation, and documentation. In the last step, I will show you how to share that script with others on the internal network.
1. Loading data
In the first step, you must download the raw data. In this case, you will go to the Facebook profile and download the source code html file (login to FB, open the profile page, right click for source view and download). In case you want the analysis to cover a period of time, download it every day for multiple days. If you don’t see a change in the buddy order, it only means that the algorithm probably didn’t run, because the FB algorithm doesn’t run every day.
In python data analysis we often use the dataframe structure. It’s like a table, datatable or a matrix. It has rows and columns, where we apply different operations, calculations. For this we will use pandas library. To find the buddy_id-s in the html files, it requires running regular expressions, which is in re library. In addition, I use the datetime and the sqlite3 library, because I will store the cleaned data with the data stamps in sqlite3 database table. If you want it to run automatically every day, or periodically, the useful component is the selenium library.
In the first step import the libraries and initialize the database connection.
import pandas as pd
from datetime import datetime
import re
import sqlite3
import datetime as dt
from pathlib import Path
conn = sqlite3.connect(r'c:\\Python_dev\\fb_buddy.db')
Depending on the frequency of the run, you can choose between the fixed file read or an interactive method.
for txt_path in Path("c:\fb_python\").glob("fb_*_profile_sourcefile.html"): f = open(txt_path, 'r',encoding='utf8')
The ipywidgets library contains some components that can serve interactivity for your data loading.
The FileUpload widget’s attributes describe the uploadable file extension with the ’accept’ property.
import ipywidgets as widgetsf = widgets.FileUpload(
accept='.html',
multiple=False
)
display(f)
In the opened and read file, the strings which start with ’buddy_id’ and end with picture, are inserted into a list.
"buddy_id":"1000XXXXXXXXXXX","user":{"id":"10000XXXXXXXXXX","__isProfile":
"User","name":"XXXXX XXXXX","story_bucket":{"nodes":[{"id":"XXXXXXXXXXXXXXXX",
"first_story_to_show":null}]},"profile_picture":
The list elements contain the names sorted by the buddy_id-s.
In different languages it’s necessary to change the special letters stored and coded in html to readable letters. The script will replace these coded characters and special letters according to the dictionary that we created manually. The data will be stored in a dataframe structure.
input_file = list(f.value.values())[0]
content = input_file['content'].decode('utf-8')lista =re.findall('buddy_id(.+?)picture', content)
for i in range(len(lista)):
lista[i] = re.findall('name\"\:"(.+?)\"\,', lista[i])df = pd.DataFrame(lista, columns={'name'})dict_hu = {'u00c1': 'Á','u00e1': 'á','u00f3': 'ó','u00e9': 'é','u0151': 'ő','u00f6': 'ö','u00ed': 'í','u00c9': 'É','u00fa': 'ú','u00fc': 'ü'}for key in dict_hu.keys():
df['name'] = df['name'].str.replace(key, dict_hu[key])
df['name'] = df['name'].str.replace('\\', '')
We will expand the dataframe with order number and datestamp and then we will load it into the sqlite3 database table.
df['rn']=df.index+1
now = datetime.now()
df['date']=now.strftime('%Y.%m.%d')
df.columns =['name', 'rn','date']df.to_sql('fb_buddy', conn, if_exists='append', index=False)
2. Data analysis and visualisation
2.1. Time series data visualisation
In the first step we will read the data table from sqlite3 database in pandas dataframe. We will then sort the previously mentioned df (dataframe) by day and we will create a time dataframe that will contain all of the days what we are going to use in the analysis.
conn = sqlite3.connect(r'c:\\Python_dev\\fb_buddy.db')
df = pd.read_sql("select * from fb_buddy", con=conn)df_days = pd.read_sql("select distinct date from fb_buddy", con=conn)df.sort_values(by=['date'], inplace=True)
df_days.sort_values(by=['date'], inplace=True)
We’ll then use an interactive dropdown menu, that contains the buddy’s names. It is important to use unique items in the list!
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
We must create a new dataframe for joining all of the days with the daily buddy names. It’s a typical left join method. For join use the merge operation as shown below.
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)@interact
def scatter_plot(Name=list(sorted(df['name'].unique()))):
df1 = df[df['name']==Name]
result = pd.merge(df_days,
df1[['date', 'name', 'rn']],
on='date',
how='left')
df1=result
df1[["date","rn"]].iplot(kind='line', x=['date'], y=['rn'])
The line chart changes dynamically according to the selected name.
2.2. Heatmap visualization
The order of your selected buddy changes in time as shown in the previous chart, but for comparing the buddy’s order in time, we will now use a heatmap. For the chart we use two libraries: seaborn and matplotlib.
import seaborn as sns
import matplotlib.pyplot as plt
We are going to use an aggregation on the dataframe and count the buddy’s occurrences on the full dataset. Select the most frequently occurring names to be used in the heatmap. In the map the red represents the bigger value of occurence.
var1 = df.groupby(['name'])['rn'].agg('count').reset_index(name='counts').sort_values(by=['counts','name'], ascending=False)['counts'].unique()[1] #or [0]df = pd.read_sql("select distinct * from fb_buddy", con=conn)
df_x = df.groupby(['name']).size().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
df1=df[df['name'].isin(df_x[(df_x['counts']==var1)]['name'].to_list())]
heatmap1_data = pd.pivot_table(df1, values='rn',
index=['name'],
columns='date')plt.figure(figsize = (15,15))
sns.heatmap(heatmap1_data, cmap="RdBu")
We can see the selected buddys correlation in another heatmap with correlation coefficience values.
df_x = df.groupby(['name']).size().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
df_y = df_x[(df_x['counts']==var1)]list_y = df_y['name'].to_list()
list_of_values = list_ydf_y = df[df['name'].isin(list_of_values)]
table = pd.pivot_table(df_y, values="rn", index="date", columns="name")
plt.figure(figsize=(15,15))dataplot=sns.heatmap(table.corr(), annot=True)
plt.show()
In this chart the lighter color represents the biggest values.
So, in this case there doesn’t exist any correlation with any pair of item. Of course this is because all of the items (buddy) are independent from each other.
3. Use Voila for sharing the analysis
If you want to share this Jupyter Notebook calculation and charts, one way to do it is to use Voila. It turns your notebooks into a standalone application.
To install the Voila:
pip install voila
and run it in command line:
voila your_file.ipynb
There is a more comfortable mode to use it as well. After installation, the Voila button will appear in the menu of the Jupyter Notebook. Press it and ’voila’ will run the app in new browser tab.