import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import traceback
import time
import json
from termcolor import colored
import string
import textwrap
import itertools
import numpy as np
import tensorflow_text as tf_text
import tensorflow as tf

import transformer_utils 
import utils

# Will come in handy later
wrapper = textwrap.TextWrapper(width=70)

# Set random seed
np.random.seed(42)

import w3_unittest

# Load example jsons
with open('data/c4-en-10k.jsonl', 'r') as file:
    example_jsons = [json.loads(line.strip()) for line in file]

# Printing the examples to see how the data looks like
for i in range(5):
    print(f'example number {i+1}: \n\n{example_jsons[i]} \n')

example number 1: 

{'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.'} 

example number 2: 

{'text': 'Discussion in \'Mac OS X Lion (10.7)\' started by axboi87, Jan 20, 2012.\nI\'ve got a 500gb internal drive and a 240gb SSD.\nWhen trying to restore using disk utility i\'m given the error "Not enough space on disk ____ to restore"\nBut I shouldn\'t have to do that!!!\nAny ideas or workarounds before resorting to the above?\nUse Carbon Copy Cloner to copy one drive to the other. I\'ve done this several times going from larger HDD to smaller SSD and I wound up with a bootable SSD drive. One step you have to remember not to skip is to use Disk Utility to partition the SSD as GUID partition scheme HFS+ before doing the clone. If it came Apple Partition Scheme, even if you let CCC do the clone, the resulting drive won\'t be bootable. CCC usually works in "file mode" and it can easily copy a larger drive (that\'s mostly empty) onto a smaller drive. If you tell CCC to clone a drive you did NOT boot from, it can work in block copy mode where the destination drive must be the same size or larger than the drive you are cloning from (if I recall).\nI\'ve actually done this somehow on Disk Utility several times (booting from a different drive (or even the dvd) so not running disk utility from the drive your cloning) and had it work just fine from larger to smaller bootable clone. Definitely format the drive cloning to first, as bootable Apple etc..\nThanks for pointing this out. My only experience using DU to go larger to smaller was when I was trying to make a Lion install stick and I was unable to restore InstallESD.dmg to a 4 GB USB stick but of course the reason that wouldn\'t fit is there was slightly more than 4 GB of data.'} 

example number 3: 

{'text': 'Foil plaid lycra and spandex shortall with metallic slinky insets. Attached metallic elastic belt with O-ring. Headband included. Great hip hop or jazz dance costume. Made in the USA.'} 

example number 4: 

{'text': "How many backlinks per day for new site?\nDiscussion in 'Black Hat SEO' started by Omoplata, Dec 3, 2010.\n1) for a newly created site, what's the max # backlinks per day I should do to be safe?\n2) how long do I have to let my site age before I can start making more blinks?\nI did about 6000 forum profiles every 24 hours for 10 days for one of my sites which had a brand new domain.\nThere is three backlinks for every of these forum profile so thats 18 000 backlinks every 24 hours and nothing happened in terms of being penalized or sandboxed. This is now maybe 3 months ago and the site is ranking on first page for a lot of my targeted keywords.\nbuild more you can in starting but do manual submission and not spammy type means manual + relevant to the post.. then after 1 month you can make a big blast..\nWow, dude, you built 18k backlinks a day on a brand new site? How quickly did you rank up? What kind of competition/searches did those keywords have?"} 

example number 5: 

{'text': 'The Denver Board of Education opened the 2017-18 school year with an update on projects that include new construction, upgrades, heat mitigation and quality learning environments.\nWe are excited that Denver students will be the beneficiaries of a four year, $572 million General Obligation Bond. Since the passage of the bond, our construction team has worked to schedule the projects over the four-year term of the bond.\nDenver voters on Tuesday approved bond and mill funding measures for students in Denver Public Schools, agreeing to invest $572 million in bond funding to build and improve schools and $56.6 million in operating dollars to support proven initiatives, such as early literacy.\nDenver voters say yes to bond and mill levy funding support for DPS students and schools. Click to learn more about the details of the voter-approved bond measure.\nDenver voters on Nov. 8 approved bond and mill funding measures for DPS students and schools. Learn more about what’s included in the mill levy measure.'}

# Grab text field from dictionary
natural_language_texts = [example_json['text'] for example_json in example_jsons]

# Print the first text example
print(natural_language_texts[0])

Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.
He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.
The cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.

# Special tokens
# PAD, EOS = 0, 1

with open("./models/sentencepiece.model", "rb") as f:
    pre_trained_tokenizer = f.read()
    
tokenizer = tf_text.SentencepieceTokenizer(pre_trained_tokenizer, out_type=tf.int32)

eos = tokenizer.string_to_id("</s>").numpy()

print("EOS: " + str(eos))

EOS: 1

# printing the encoding of each word to see how subwords are tokenized
tokenized_text = [(list(tokenizer.tokenize(word).numpy()), word) for word in natural_language_texts[2].split()]

print("Word\t\t-->\tTokenization")
print("-"*40)
for element in tokenized_text:
    print(f"{element[1]:<8}\t-->\t{element[0]}")

Word		-->	Tokenization
----------------------------------------
Foil    	-->	[4452, 173]
plaid   	-->	[30772]
lycra   	-->	[3, 120, 2935]
and     	-->	[11]
spandex 	-->	[8438, 26, 994]
shortall	-->	[710, 1748]
with    	-->	[28]
metallic	-->	[18813]
slinky  	-->	[3, 7, 4907, 63]
insets. 	-->	[16, 2244, 7, 5]
Attached	-->	[28416, 15, 26]
metallic	-->	[18813]
elastic 	-->	[15855]
belt    	-->	[6782]
with    	-->	[28]
O-ring. 	-->	[411, 18, 1007, 5]
Headband	-->	[3642, 3348]
included.	-->	[1285, 5]
Great   	-->	[1651]
hip     	-->	[5436]
hop     	-->	[13652]
or      	-->	[42]
jazz    	-->	[9948]
dance   	-->	[2595]
costume.	-->	[11594, 5]
Made    	-->	[6465]
in      	-->	[16]
the     	-->	[8]
USA.    	-->	[2312, 5]

# We can see that detokenize successfully undoes the tokenization
print(f"tokenized: {tokenizer.tokenize('Beginners')}\ndetokenized: {tokenizer.detokenize(tokenizer.tokenize('Beginners'))}")

tokenized: [12847   277]
detokenized: b'Beginners'

string.ascii_letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

def get_sentinels(tokenizer, display=False):
    sentinels = {}
    vocab_size = tokenizer.vocab_size(name=None)
    for i, char in enumerate(reversed(string.ascii_letters), 1):
        decoded_text = tokenizer.detokenize([vocab_size - i]).numpy().decode("utf-8")
        
        # Sentinels, ex: <Z> - <a>
        sentinels[decoded_text] = f'<{char}>'    
    
        if display:
            print(f'The sentinel is <{char}> and the decoded token is:', decoded_text)

    return sentinels

def pretty_decode(encoded_str_list, sentinels, tokenizer):
    # If already a string, just do the replacements.
    if tf.is_tensor(encoded_str_list) and encoded_str_list.dtype == tf.string:
        for token, char in sentinels.items():
            encoded_str_list = tf.strings.regex_replace(encoded_str_list, token, char)
        return encoded_str_list
  
    # We need to decode and then prettyfy it.
    return pretty_decode(tokenizer.detokenize(encoded_str_list), sentinels, tokenizer)

sentinels = get_sentinels(tokenizer, display=True)

The sentinel is <Z> and the decoded token is: Internațional
The sentinel is <Y> and the decoded token is: erwachsene
The sentinel is <X> and the decoded token is: Cushion
The sentinel is <W> and the decoded token is: imunitar
The sentinel is <V> and the decoded token is: Intellectual
The sentinel is <U> and the decoded token is: traditi
The sentinel is <T> and the decoded token is: disguise
The sentinel is <S> and the decoded token is: exerce
The sentinel is <R> and the decoded token is: nourishe
The sentinel is <Q> and the decoded token is: predominant
The sentinel is <P> and the decoded token is: amitié
The sentinel is <O> and the decoded token is: erkennt
The sentinel is <N> and the decoded token is: dimension
The sentinel is <M> and the decoded token is: inférieur
The sentinel is <L> and the decoded token is: refugi
The sentinel is <K> and the decoded token is: cheddar
The sentinel is <J> and the decoded token is: unterlieg
The sentinel is <I> and the decoded token is: garanteaz
The sentinel is <H> and the decoded token is: făcute
The sentinel is <G> and the decoded token is: réglage
The sentinel is <F> and the decoded token is: pedepse
The sentinel is <E> and the decoded token is: Germain
The sentinel is <D> and the decoded token is: distinctly
The sentinel is <C> and the decoded token is: Schraub
The sentinel is <B> and the decoded token is: emanat
The sentinel is <A> and the decoded token is: trimestre
The sentinel is <z> and the decoded token is: disrespect
The sentinel is <y> and the decoded token is: Erasmus
The sentinel is <x> and the decoded token is: Australia
The sentinel is <w> and the decoded token is: permeabil
The sentinel is <v> and the decoded token is: deseori
The sentinel is <u> and the decoded token is: manipulated
The sentinel is <t> and the decoded token is: suggér
The sentinel is <s> and the decoded token is: corespund
The sentinel is <r> and the decoded token is: nitro
The sentinel is <q> and the decoded token is: oyons
The sentinel is <p> and the decoded token is: Account
The sentinel is <o> and the decoded token is: échéan
The sentinel is <n> and the decoded token is: laundering
The sentinel is <m> and the decoded token is: genealogy
The sentinel is <l> and the decoded token is: QuickBooks
The sentinel is <k> and the decoded token is: constituted
The sentinel is <j> and the decoded token is: Fertigung
The sentinel is <i> and the decoded token is: goutte
The sentinel is <h> and the decoded token is: regulă
The sentinel is <g> and the decoded token is: overwhelmingly
The sentinel is <f> and the decoded token is: émerg
The sentinel is <e> and the decoded token is: broyeur
The sentinel is <d> and the decoded token is: povești
The sentinel is <c> and the decoded token is: emulator
The sentinel is <b> and the decoded token is: halloween
The sentinel is <a> and the decoded token is: combustibil

pretty_decode(tf.constant("I want to dress up as an Intellectual this halloween."), sentinels, tokenizer)

<tf.Tensor: shape=(), dtype=string, numpy=b'I want to dress up as an <V> this <b>.'>

# GRADED FUNCTION: tokenize_and_mask
def tokenize_and_mask(text, 
                      noise=0.15, 
                      randomizer=np.random.uniform, 
                      tokenizer=None):
    """Tokenizes and masks a given input.

    Args:
        text (str or bytes): Text input.
        noise (float, optional): Probability of masking a token. Defaults to 0.15.
        randomizer (function, optional): Function that generates random values. Defaults to np.random.uniform.
        tokenizer (function, optional): Tokenizer function. Defaults to tokenize.

    Returns:
        inps, targs: Lists of integers associated to inputs and targets.
    """
    
    # Current sentinel number (starts at 0)
    cur_sentinel_num = 0
    
    # Inputs and targets
    inps, targs = [], []

    # Vocab_size
    vocab_size = int(tokenizer.vocab_size())
    
    # EOS token id 
    # Must be at the end of each target!
    eos = tokenizer.string_to_id("</s>").numpy()
    
    ### START CODE HERE ###
    
    # prev_no_mask is True if the previous token was NOT masked, False otherwise
    # set prev_no_mask to True
    prev_no_mask = True
    
    # Loop over the tokenized text
    for token in tokenizer.tokenize(text).numpy():
        
        # Generate a random value between 0 and 1
        rnd_val = randomizer() 
        
        # Check if the noise is greater than a random value (weighted coin flip)
        if noise > rnd_val:
            
            # Check if previous token was NOT masked
            if prev_no_mask:
                
                # Current sentinel increases by 1
                cur_sentinel_num += 1
                
                # Compute end_id by subtracting current sentinel value out of the total vocabulary size
                end_id = vocab_size - cur_sentinel_num
                
                # Append end_id at the end of the targets
                targs.append(end_id)
                
                # Append end_id at the end of the inputs
                inps.append(end_id)
                
            # Append token at the end of the targets
            targs.append(token)
            
            # set prev_no_mask accordingly
            prev_no_mask = False

        else:
            
            # Append token at the end of the inputs
            inps.append(token)
            
            # Set prev_no_mask accordingly
            prev_no_mask = True
    
    
    # Add EOS token to the end of the targets
    targs.append(eos)
    
    ### END CODE HERE ###
    
    return inps, targs

# Some logic to mock a np.random value generator
# Needs to be in the same cell for it to always generate same output
def testing_rnd():
    def dummy_generator():
        vals = np.linspace(0, 1, 10)
        cyclic_vals = itertools.cycle(vals)
        for _ in range(100):
            yield next(cyclic_vals)

    dumr = itertools.cycle(dummy_generator())

    def dummy_randomizer():
        return next(dumr)
    
    return dummy_randomizer

input_str = 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers.'

inps, targs = tokenize_and_mask(input_str, randomizer=testing_rnd(), tokenizer=tokenizer)
print(f"tokenized inputs - shape={len(inps)}:\n\n{inps}\n\ntargets - shape={len(targs)}:\n\n{targs}")

tokenized inputs - shape=53:

[31999, 15068, 4501, 3, 12297, 3399, 16, 5964, 7115, 31998, 531, 25, 241, 12, 129, 394, 44, 492, 31997, 58, 148, 56, 43, 8, 1004, 6, 474, 31996, 39, 4793, 230, 5, 2721, 6, 1600, 1630, 31995, 1150, 4501, 15068, 16127, 6, 9137, 2659, 5595, 31994, 782, 3624, 14627, 15, 12612, 277, 5]

targets - shape=19:

[31999, 12847, 277, 31998, 9, 55, 31997, 3326, 15068, 31996, 48, 30, 31995, 727, 1715, 31994, 45, 301, 1]

# Test your implementation!
w3_unittest.test_tokenize_and_mask(tokenize_and_mask)

 All tests passed

print('Inputs: \n\n', pretty_decode(inps, sentinels, tokenizer).numpy())
print('\nTargets: \n\n', pretty_decode(targs, sentinels, tokenizer).numpy())

Inputs: 

 b'<Z> BBQ Class Taking Place in Missoul <Y> Do you want to get better at making <X>? You will have the opportunity, put <W> your calendar now. Thursday, September 22 <V> World Class BBQ Champion, Tony Balay <U>onestar Smoke Rangers.'

Targets: 

 b'<Z> Beginners <Y>a! <X> delicious BBQ <W> this on <V>nd join <U> from L'

# Apply tokenize_and_mask
inputs_targets_pairs = [tokenize_and_mask(text.encode('utf-8', errors='ignore').decode('utf-8'), tokenizer=tokenizer) 
                        for text in natural_language_texts[0:2000]]

def display_input_target_pairs(inputs_targets_pairs, sentinels, wrapper=textwrap.TextWrapper(width=70), tokenizer=tokenizer):
    for i, inp_tgt_pair in enumerate(inputs_targets_pairs, 1):
        inps, tgts = inp_tgt_pair
        inps = str(pretty_decode(inps, sentinels, tokenizer).numpy(), encoding='utf-8')
        tgts = str(pretty_decode(tgts, sentinels, tokenizer).numpy(), encoding='utf-8')
        print(f'[{i}]\n\n'
              f'inputs:\n{wrapper.fill(text=inps)}\n\n'
              f'targets:\n{wrapper.fill(text=tgts)}\n\n\n')

# Print 3 samples. We print inputs with less than 100 tokens. It is just to give you and idea of the process
display_input_target_pairs(filter(lambda x: len(x[0]) < 100, inputs_targets_pairs[0:12]), sentinels, wrapper, tokenizer)

[1]

inputs:
<Z>il plaid <Y>lycra <X> spandex shortall with metallic slinky
<W>sets. Attache <V> metallic elastic belt with O <U>ring. Head <T>
included. Great hip hop<S> jazz dance costume.<R> in the USA.

targets:
<Z> Fo <Y>  <X> and <W> in <V>d <U>- <T>band<S> or<R> Made


[2]

inputs:
I thought I was going to <Z> 3rd season <Y> Wire tonight. <X> there
was a commentary <W> 11, so I had to re <V>watch <U> Ground with <T>
commentary. Hopefully<S> can finish<R> season <Q>.

targets:
<Z> finish the <Y> of the <X> But <W> on episode <V>- <U> Middle <T>
the<S> I<R> the <Q> next weekend


[3]

inputs:
Pencarian <Z>FILM Untuk " <Y>eace <X>er 2017 <W> yuk mampir ke channel
say <V>. Edges <U> provides the l.. A corrupt cop makes one w.. <T>er
2017  ⁇ <S> ⁇  .. Náo Lo ⁇ n - Peace Break.. Please subscribe and hit
..<R> in HD at http://.. <Q> cannot believe I manage..

targets:
<Z>  <Y>P <X> Break <W>" <V>. <U> East <T> Peace Break<S> <R> uploaded
<Q> I

# Define the model parameters
num_layers = 2
embedding_dim = 128
fully_connected_dim = 128
num_heads = 2
positional_encoding_length = 256

encoder_vocab_size = int(tokenizer.vocab_size())
decoder_vocab_size = encoder_vocab_size

# Initialize the model
transformer = transformer_utils.Transformer(
    num_layers, 
    embedding_dim, 
    num_heads, 
    fully_connected_dim,
    encoder_vocab_size, 
    decoder_vocab_size, 
    positional_encoding_length, 
    positional_encoding_length,
)

learning_rate = transformer_utils.CustomSchedule(embedding_dim)
optimizer = tf.keras.optimizers.Adam(0.0001, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
train_loss = tf.keras.metrics.Mean(name='train_loss')

# Here you will store the losses, so you can later plot them
losses = []

# Limit the size of the input and output data so this can run in this environment
encoder_maxlen = 150
decoder_maxlen = 50

inputs = tf.keras.preprocessing.sequence.pad_sequences([x[0] for x in inputs_targets_pairs], maxlen=encoder_maxlen, padding='post', truncating='post')
targets = tf.keras.preprocessing.sequence.pad_sequences([x[1] for x in inputs_targets_pairs], maxlen=decoder_maxlen, padding='post', truncating='post')

inputs = tf.cast(inputs, dtype=tf.int32)
targets = tf.cast(targets, dtype=tf.int32)

# Create the final training dataset.
BUFFER_SIZE = 10000
BATCH_SIZE = 64

dataset = tf.data.Dataset.from_tensor_slices((inputs, targets)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

# Define the number of epochs
epochs = 10

# Training loop
for epoch in range(epochs):
    
    start = time.time()
    train_loss.reset_states()
    number_of_batches=len(list(enumerate(dataset)))

    for (batch, (inp, tar)) in enumerate(dataset):
        print(f'Epoch {epoch+1}, Batch {batch+1}/{number_of_batches}', end='\r')
        transformer_utils.train_step(inp, tar, transformer, loss_object, optimizer, train_loss)
    
    print (f'Epoch {epoch+1}, Loss {train_loss.result():.4f}')
    losses.append(train_loss.result())
    
    print (f'Time taken for one epoch: {time.time() - start} sec')

# Save the pretrained model
# transformer.save_weights('./model_c4_temp')

Epoch 1, Loss 10.1184
Time taken for one epoch: 27.30163812637329 sec
Epoch 2, Loss 9.5150
Time taken for one epoch: 7.5052008628845215 sec
Epoch 3, Loss 8.9270
Time taken for one epoch: 6.7959606647491455 sec
Epoch 4, Loss 8.3857
Time taken for one epoch: 6.633919715881348 sec
Epoch 5, Loss 7.8903
Time taken for one epoch: 5.052973985671997 sec
Epoch 6, Loss 7.4477
Time taken for one epoch: 4.966921329498291 sec
Epoch 7, Loss 7.0561
Time taken for one epoch: 4.452875137329102 sec
Epoch 8, Loss 6.7168
Time taken for one epoch: 5.342992782592773 sec
Epoch 9, Loss 6.4472
Time taken for one epoch: 4.89825439453125 sec
Epoch 10, Loss 6.2444
Time taken for one epoch: 4.38511323928833 sec

transformer.load_weights('./pretrained_models/model_c4')

<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x7fa43ce98eb0>

with open('data/train-v2.0.json', 'r') as f:
    example_jsons = json.load(f)

example_jsons = example_jsons['data']

print('Number of articles: ' + str(len(example_jsons)))

Number of articles: 442

example_article = example_jsons[0]
example_article

print("Title: " + example_article["title"])
print(example_article["paragraphs"][0])

Title: Beyoncé
{'qas': [{'question': 'When did Beyonce start becoming popular?', 'id': '56be85543aeaaa14008c9063', 'answers': [{'text': 'in the late 1990s', 'answer_start': 269}], 'is_impossible': False}, {'question': 'What areas did Beyonce compete in when she was growing up?', 'id': '56be85543aeaaa14008c9065', 'answers': [{'text': 'singing and dancing', 'answer_start': 207}], 'is_impossible': False}, {'question': "When did Beyonce leave Destiny's Child and become a solo singer?", 'id': '56be85543aeaaa14008c9066', 'answers': [{'text': '2003', 'answer_start': 526}], 'is_impossible': False}, {'question': 'In what city and state did Beyonce  grow up? ', 'id': '56bf6b0f3aeaaa14008c9601', 'answers': [{'text': 'Houston, Texas', 'answer_start': 166}], 'is_impossible': False}, {'question': 'In which decade did Beyonce become famous?', 'id': '56bf6b0f3aeaaa14008c9602', 'answers': [{'text': 'late 1990s', 'answer_start': 276}], 'is_impossible': False}, {'question': 'In what R&B group was she the lead singer?', 'id': '56bf6b0f3aeaaa14008c9603', 'answers': [{'text': "Destiny's Child", 'answer_start': 320}], 'is_impossible': False}, {'question': 'What album made her a worldwide known artist?', 'id': '56bf6b0f3aeaaa14008c9604', 'answers': [{'text': 'Dangerously in Love', 'answer_start': 505}], 'is_impossible': False}, {'question': "Who managed the Destiny's Child group?", 'id': '56bf6b0f3aeaaa14008c9605', 'answers': [{'text': 'Mathew Knowles', 'answer_start': 360}], 'is_impossible': False}, {'question': 'When did Beyoncé rise to fame?', 'id': '56d43c5f2ccc5a1400d830a9', 'answers': [{'text': 'late 1990s', 'answer_start': 276}], 'is_impossible': False}, {'question': "What role did Beyoncé have in Destiny's Child?", 'id': '56d43c5f2ccc5a1400d830aa', 'answers': [{'text': 'lead singer', 'answer_start': 290}], 'is_impossible': False}, {'question': 'What was the first album Beyoncé released as a solo artist?', 'id': '56d43c5f2ccc5a1400d830ab', 'answers': [{'text': 'Dangerously in Love', 'answer_start': 505}], 'is_impossible': False}, {'question': 'When did Beyoncé release Dangerously in Love?', 'id': '56d43c5f2ccc5a1400d830ac', 'answers': [{'text': '2003', 'answer_start': 526}], 'is_impossible': False}, {'question': 'How many Grammy awards did Beyoncé win for her first solo album?', 'id': '56d43c5f2ccc5a1400d830ad', 'answers': [{'text': 'five', 'answer_start': 590}], 'is_impossible': False}, {'question': "What was Beyoncé's role in Destiny's Child?", 'id': '56d43ce42ccc5a1400d830b4', 'answers': [{'text': 'lead singer', 'answer_start': 290}], 'is_impossible': False}, {'question': "What was the name of Beyoncé's first solo album?", 'id': '56d43ce42ccc5a1400d830b5', 'answers': [{'text': 'Dangerously in Love', 'answer_start': 505}], 'is_impossible': False}], 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'}

{
  "context": "Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles 'Crazy in Love' and 'Baby Boy'",
  "qas": [
    {
      "question": "When did Beyonce start becoming popular?",
      "id": "56be85543aeaaa14008c9063",
      "answers": [
        {
          "text": "in the late 1990s",
          "answer_start": 269
        }
      ],
      "is_impossible": false
    },
    {
      "question": "What areas did Beyonce compete in when she was growing up?",
      "id": "56be85543aeaaa14008c9065",
      "answers": [
        {
          "text": "singing and dancing",
          "answer_start": 207
        }
      ],
      "is_impossible": false
    }
  ]
}

# GRADED FUNCTION: parse_squad
def parse_squad(dataset):
    """Extract all the answers/questions pairs from the SQuAD dataset

    Args:
        dataset (dict): The imported JSON dataset

    Returns:
        inputs, targets: Two lists containing the inputs and the targets for the QA model
    """

    inputs, targets = [], []

    ### START CODE HERE ###
    
    # Loop over all the articles
    for article in dataset:
        
        # Loop over each paragraph of each article
        for paragraph in article['paragraphs']:
            
            # Extract context from the paragraph
            context = paragraph['context']
            
            # Loop over each question of the given paragraph
            for qa in paragraph['qas']:
                
                # If this question is not impossible and there is at least one answer
                if len(qa['answers']) > 0 and not qa['is_impossible']:
                    
                    # Create the question/context sequence
                    question_context = 'question: ' + qa["question"] + ' context: ' + context
                    
                    # Create the answer sequence. Use the text field of the first answer
                    answer = 'answer: ' + qa['answers'][0]["text"]
                    
                    # Add the question_context to the inputs list
                    inputs.append(question_context)
                    
                    # Add the answer to the targets list
                    targets.append(answer)
    
    ### END CODE HERE ###
    
    return inputs, targets

inputs, targets =  parse_squad(example_jsons)          
print("Number of question/answer pairs: " + str(len(inputs)))

print('\nFirst Q/A pair:\n\ninputs: ' + colored(inputs[0], 'blue'))
print('\ntargets: ' + colored(targets[0], 'green'))
print('\nLast Q/A pair:\n\ninputs: ' + colored(inputs[-1], 'blue'))
print('\ntargets: ' + colored(targets[-1], 'green'))

Number of question/answer pairs: 86821

First Q/A pair:

inputs: question: When did Beyonce start becoming popular? context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".

targets: answer: in the late 1990s

Last Q/A pair:

inputs: question: What is KMC an initialism of? context: Kathmandu Metropolitan City (KMC), in order to promote international relations has established an International Relations Secretariat (IRC). KMC's first international relationship was established in 1975 with the city of Eugene, Oregon, United States. This activity has been further enhanced by establishing formal relationships with 8 other cities: Motsumoto City of Japan, Rochester of the USA, Yangon (formerly Rangoon) of Myanmar, Xi'an of the People's Republic of China, Minsk of Belarus, and Pyongyang of the Democratic Republic of Korea. KMC's constant endeavor is to enhance its interaction with SAARC countries, other International agencies and many other major cities of the world to achieve better urban management and developmental programs for Kathmandu.

targets: answer: Kathmandu Metropolitan City

# UNIT TEST
w3_unittest.test_parse_squad(parse_squad)

 All tests passed

# 40K pairs for training
inputs_train = inputs[0:40000] 
targets_train = targets[0:40000]  

# 5K pairs for testing
inputs_test = inputs[40000:45000] 
targets_test =  targets[40000:45000]

# Limit the size of the input and output data so this can run in this environment
encoder_maxlen = 150
decoder_maxlen = 50

inputs_str = [tokenizer.tokenize(s) for s in inputs_train]
targets_str = [tf.concat([tokenizer.tokenize(s), [1]], 0) for s in targets_train]

inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs_str, maxlen=encoder_maxlen, padding='post', truncating='post')
targets = tf.keras.preprocessing.sequence.pad_sequences(targets_str, maxlen=decoder_maxlen, padding='post', truncating='post')

inputs = tf.cast(inputs, dtype=tf.int32)
targets = tf.cast(targets, dtype=tf.int32)

# Create the final training dataset.
BUFFER_SIZE = 10000
BATCH_SIZE = 64
dataset = tf.data.Dataset.from_tensor_slices((inputs, targets)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

# Define the number of epochs
epochs = 2
losses = []

# Training loop
for epoch in range(epochs):
    
    start = time.time()
    train_loss.reset_states()
    number_of_batches=len(list(enumerate(dataset)))

    for (batch, (inp, tar)) in enumerate(dataset):
        print(f'Epoch {epoch+1}, Batch {batch+1}/{number_of_batches}', end='\r')
        transformer_utils.train_step(inp, tar, transformer, loss_object, optimizer, train_loss)
    
    print (f'Epoch {epoch+1}, Loss {train_loss.result():.4f}')
    losses.append(train_loss.result())
    
    print (f'Time taken for one epoch: {time.time() - start} sec')
    #if epoch % 15 == 0:
        #transformer.save_weights('./pretrained_models/model_qa_temp')
# Save the final model
#transformer.save_weights('./pretrained_models/model_qa_temp')

Epoch 1, Loss 5.976225
Time taken for one epoch: 76.0023992061615 sec
Epoch 2, Loss 5.351825
Time taken for one epoch: 34.14668011665344 sec

# Restore the weights
transformer.load_weights('./pretrained_models/model_qa3')

<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x7fa5c06a0880>

# Define an example question
example_question = "question: What color is the sky? context: Sky is blue"

# Question is tokenized and padded
# Note that this is hardcoded here but you must implement this in the upcoming exercise
tokenized_padded_question = tf.constant([[822, 10, 363, 945, 19, 8, 5796, 58, 2625, 10, 5643, 19, 1692, 0, 0]])

# All answers begin with the string "answer: "
# Feel free to check that this is indeed the tokenized version of that string
tokenized_answer = tf.constant([[1525,   10]])

# Predict the next word using the transformer_utils.next_word function
# Notice that it expects the question, answer and model (in that order)
next_word = transformer_utils.next_word(tokenized_padded_question, tokenized_answer, transformer)

print(f"Predicted next word is: '{tokenizer.detokenize(next_word).numpy()[0].decode('utf-8')}'")

# Concatenate predicted word with answer so far
answer_so_far = tf.concat([tokenized_answer, next_word], axis=-1)

print(f"Answer so far: '{tokenizer.detokenize(answer_so_far).numpy()[0].decode('utf-8')}'")

Predicted next word is: 'blue'
Answer so far: 'answer: blue'

# GRADED FUNCTION: answer_question
def answer_question(question, model, tokenizer, encoder_maxlen=150, decoder_maxlen=50):
    """
    A function for question answering using the transformer model.
    Arguments:
        question (str): Input data with question and context.
        model (tf.keras.Model): The transformer model.
        tokenizer (SentencePieceProcessor): The SentencePiece tokenizer.
        encoder_maxlen (int): Max length of the encoded sequence.
        decoder_maxlen (int): Max length of the decoded sequence.
    Returns:
        answer (str): The answer to the question.
    """
    
    ### START CODE HERE ###
    
    # QUESTION SETUP
    
    # Tokenize the question
    tokenized_question = tokenizer.tokenize(question)
    
    # Add an extra dimension to the tensor
    tokenized_question = tf.expand_dims(tokenized_question, 0) 
    
    # Pad the question tensor
    padded_question = tf.keras.preprocessing.sequence.pad_sequences(tokenized_question,
                                                                    maxlen=encoder_maxlen,
                                                                    padding='post', 
                                                                    truncating='post') 
    
    # ANSWER SETUP
    
    # Initialize the answer with the token for "answer: "
    tokenized_answer = tokenizer.tokenize("answer: ")
    
    # Add an extra dimension to the tensor
    tokenized_answer = tf.expand_dims(tokenized_answer, 0)
    
    # Get the id of the EOS token
    eos = tokenizer.string_to_id("</s>") 
    
    # Loop for decoder_maxlen iterations
    for i in range(decoder_maxlen):
        
        # Predict the next word using the model, the input document, and the current state of output
        next_word_id = transformer_utils.next_word(padded_question, tokenized_answer, model)
        
        # Concat the predicted next word to the output 
        tokenized_answer =tf.concat([tokenized_answer, next_word_id], axis=-1)
        
        # The text generation stops if the model predicts the EOS token
        if next_word_id == eos:
            break 
    
    ### END CODE HERE ###
    
    return tokenized_answer

idx = 10408

result = answer_question(inputs_train[idx], transformer, tokenizer)
print(colored(pretty_decode(result, sentinels, tokenizer).numpy()[0], 'blue'))
print()
print(inputs_train[idx])
print(colored(targets_train[idx], 'green'))

b'answer: January 9, 1957'

question: When was the Chechen-Ingush Autonomous Soviet Socialist Republic transferred from the Georgian SSR? context: On January 9, 1957, Karachay Autonomous Oblast and Chechen-Ingush Autonomous Soviet Socialist Republic were restored by Khrushchev and they were transferred from the Georgian SSR back to the Russian SFSR.
answer: January 9, 1957

# UNIT TEST
w3_unittest.test_answer_question(answer_question)

 All tests passed

idx = 110
result = answer_question(inputs_test[idx], transformer, tokenizer)
print(colored(pretty_decode(result, sentinels, tokenizer).numpy()[0], 'blue'))
print()
print(inputs_test[idx])
print(colored(targets_test[idx], 'green'))

b'answer: 50'

question:  What percentage of the vote was recorded as approving Napoleon's constitution? context: Napoleon established a political system that historian Martyn Lyons called "dictatorship by plebiscite." Worried by the democratic forces unleashed by the Revolution, but unwilling to ignore them entirely, Napoleon resorted to regular electoral consultations with the French people on his road to imperial power. He drafted the Constitution of the Year VIII and secured his own election as First Consul, taking up residence at the Tuileries. The constitution was approved in a rigged plebiscite held the following January, with 99.94 percent officially listed as voting "yes." Napoleon's brother, Lucien, had falsified the returns to show that 3 million people had participated in the plebiscite; the real number was 1.5 million. Political observers at the time assumed the eligible French voting public numbered about 5 million people, so the regime artificially doubled the participation rate to indicate popular enthusiasm for the Consulate. In the first few months of the Consulate, with war in Europe still raging and internal instability still plaguing the country, Napoleon's grip on power remained very tenuous.
answer: 99.94

idx = 301
result = answer_question(inputs_test[idx], transformer, tokenizer)
print(colored(pretty_decode(result, sentinels, tokenizer).numpy()[0], 'blue'))
print()
print(inputs_test[idx])
print(colored(targets_test[idx], 'green'))

b'answer: 8th Street'

question: On what date was Napoleon contacted by the 5th Regiment? context: The 5th Regiment was sent to intercept him and made contact just south of Grenoble on March 7, 1815. Napoleon approached the regiment alone, dismounted his horse and, when he was within gunshot range, shouted to the soldiers, "Here I am. Kill your Emperor, if you wish." The soldiers quickly responded with, "Vive L'Empereur!" Ney, who had boasted to the restored Bourbon king, Louis XVIII, that he would bring Napoleon to Paris in an iron cage, affectionately kissed his former emperor and forgot his oath of allegiance to the Bourbon monarch. The two then marched together towards Paris with a growing army. The unpopular Louis XVIII fled to Belgium after realizing he had little political support. On March 13, the powers at the Congress of Vienna declared Napoleon an outlaw. Four days later, Great Britain, Russia, Austria, and Prussia each pledged to put 150,000 men into the field to end his rule.
answer: March 7, 1815

idx = 311
result = answer_question(inputs_test[idx], transformer, tokenizer)
print(colored(pretty_decode(result, sentinels, tokenizer).numpy()[0], 'blue'))
print()
print(inputs_test[idx])
print(colored(targets_test[idx], 'green'))

b'answer: June 1840'

question:  On what date was a state funeral held for Napoleon? context: In 1840, Louis Philippe I obtained permission from the British to return Napoleon's remains to France. On 15 December 1840, a state funeral was held. The hearse proceeded from the Arc de Triomphe down the Champs-Élysées, across the Place de la Concorde to the Esplanade des Invalides and then to the cupola in St Jérôme's Chapel, where it remained until the tomb designed by Louis Visconti was completed. In 1861, Napoleon's remains were entombed in a porphyry sarcophagus in the crypt under the dome at Les Invalides.
answer: 15 December 1840

Assignment 3: Question Answering¶

Table of Contents¶

Overview¶

TIPS FOR SUCCESSFUL GRADING OF YOUR ASSIGNMENT:¶

Importing the Packages¶

1 - Prepare the data for pretraining T5¶

1.1 - Pre-Training Objective¶

1.2 - C4 Dataset¶

1.3 - Process C4¶

1.4 - Decode to Natural Language¶

1.5 - Tokenizing and Masking¶

Exercise 1 - tokenize_and_mask¶

Expected Output:¶

1.6 - Creating the Pairs¶

2 - Pretrain a T5 model using C4¶

2.1 - Instantiate a new transformer model¶

2.2 - C4 pretraining¶

3. Fine tune the T5 model for Question Answering¶

3.1 - Creating a list of paired question and answers¶

Exercise 2 - Parse the SQuaD 2.0 Dataset¶

Expected Output:¶

3.2 Fine tune the T5 model¶

3.3 - Implement your Question Answering model¶

Exercise 3 - Implement the question answering function¶

Expected Output:¶