File size: 52,356 Bytes
cfd3735
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "692f3256",
   "metadata": {},
   "source": [
    "# Evaluating an OpenAPI Chain\n",
    "\n",
    "This notebook goes over ways to semantically evaluate an [OpenAPI Chain](openapi.ipynb), which calls an endpoint defined by the OpenAPI specification using purely natural language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a457106d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.tools import OpenAPISpec, APIOperation\n",
    "from langchain.chains import OpenAPIEndpointChain, LLMChain\n",
    "from langchain.requests import Requests\n",
    "from langchain.llms import OpenAI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c3b0954",
   "metadata": {},
   "source": [
    "## Load the API Chain\n",
    "\n",
    "Load a wrapper of the spec (so we can work with it more easily). You can load from a url or from a local file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "794142ba",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n"
     ]
    }
   ],
   "source": [
    "# Load and parse the OpenAPI Spec\n",
    "spec = OpenAPISpec.from_url(\"https://www.klarna.com/us/shopping/public/openai/v0/api-docs/\")\n",
    "# Load a single endpoint operation\n",
    "operation = APIOperation.from_openapi_spec(spec, '/public/openai/v0/products', \"get\")\n",
    "verbose = False\n",
    "# Select any LangChain LLM\n",
    "llm = OpenAI(temperature=0, max_tokens=1000)\n",
    "# Create the endpoint chain\n",
    "api_chain = OpenAPIEndpointChain.from_api_operation(\n",
    "    operation, \n",
    "    llm, \n",
    "    requests=Requests(), \n",
    "    verbose=verbose,\n",
    "    return_intermediate_steps=True # Return request and response text\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c05ba5b",
   "metadata": {},
   "source": [
    "### *Optional*: Generate Input Questions and Request Ground Truth Queries\n",
    "\n",
    "See [Generating Test Datasets](#Generating-Test-Datasets) at the end of this notebook for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a0c0cb7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import re\n",
    "# from langchain.prompts import PromptTemplate\n",
    "\n",
    "# template = \"\"\"Below is a service description:\n",
    "\n",
    "# {spec}\n",
    "\n",
    "# Imagine you're a new user trying to use {operation} through a search bar. What are 10 different things you want to request?\n",
    "# Wants/Questions:\n",
    "# 1. \"\"\"\n",
    "\n",
    "# prompt = PromptTemplate.from_template(template)\n",
    "\n",
    "# generation_chain = LLMChain(llm=llm, prompt=prompt)\n",
    "\n",
    "# questions_ = generation_chain.run(spec=operation.to_typescript(), operation=operation.operation_id).split('\\n')\n",
    "# # Strip preceding numeric bullets\n",
    "# questions = [re.sub(r'^\\d+\\. ', '', q).strip() for q in questions_]\n",
    "# questions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f3d767ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ground_truths = [\n",
    "# {\"q\": ...} # What are the best queries for each input?\n",
    "# ]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81098a05",
   "metadata": {},
   "source": [
    "## Run the API Chain\n",
    "\n",
    "The two simplest questions a user of the API Chain are:\n",
    "- Did the chain succesfully access the endpoint?\n",
    "- Did the action accomplish the correct result?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "64bc7ed9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "# Collect metrics to report at completion\n",
    "scores = defaultdict(list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "dfd2d09f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--openapi-chain-klarna-products-get-5d03362007667626/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "10932c9c139941d1a8be1a798f29e923",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from langchain.evaluation.loading import load_dataset\n",
    "dataset = load_dataset(\"openapi-chain-klarna-products-get\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e08191a7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'question': 'What iPhone models are available?',\n",
       "  'expected_query': {'max_price': None, 'q': 'iPhone'}},\n",
       " {'question': 'Are there any budget laptops?',\n",
       "  'expected_query': {'max_price': 300, 'q': 'laptop'}},\n",
       " {'question': 'Show me the cheapest gaming PC.',\n",
       "  'expected_query': {'max_price': 500, 'q': 'gaming pc'}},\n",
       " {'question': 'Are there any tablets under $400?',\n",
       "  'expected_query': {'max_price': 400, 'q': 'tablet'}},\n",
       " {'question': 'What are the best headphones?',\n",
       "  'expected_query': {'max_price': None, 'q': 'headphones'}},\n",
       " {'question': 'What are the top rated laptops?',\n",
       "  'expected_query': {'max_price': None, 'q': 'laptop'}},\n",
       " {'question': 'I want to buy some shoes. I like Adidas and Nike.',\n",
       "  'expected_query': {'max_price': None, 'q': 'shoe'}},\n",
       " {'question': 'I want to buy a new skirt',\n",
       "  'expected_query': {'max_price': None, 'q': 'skirt'}},\n",
       " {'question': 'My company is asking me to get a professional Deskopt PC - money is no object.',\n",
       "  'expected_query': {'max_price': 10000, 'q': 'professional desktop PC'}},\n",
       " {'question': 'What are the best budget cameras?',\n",
       "  'expected_query': {'max_price': 300, 'q': 'camera'}}]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7ee71384",
   "metadata": {},
   "outputs": [],
   "source": [
    "questions = [d['question'] for d in dataset]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "00511f7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "## Run the the API chain itself\n",
    "raise_error = False # Stop on first failed example - useful for development\n",
    "chain_outputs = []\n",
    "failed_examples = []\n",
    "for question in questions:\n",
    "    try:\n",
    "        chain_outputs.append(api_chain(question))\n",
    "        scores[\"completed\"].append(1.0)\n",
    "    except Exception as e:\n",
    "        if raise_error:\n",
    "            raise e\n",
    "        failed_examples.append({'q': question, 'error': e})\n",
    "        scores[\"completed\"].append(0.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "f3c9729f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# If the chain failed to run, show the failing examples\n",
    "failed_examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "914e7587",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['There are currently 10 Apple iPhone models available: Apple iPhone 14 Pro Max 256GB, Apple iPhone 12 128GB, Apple iPhone 13 128GB, Apple iPhone 14 Pro 128GB, Apple iPhone 14 Pro 256GB, Apple iPhone 14 Pro Max 128GB, Apple iPhone 13 Pro Max 128GB, Apple iPhone 14 128GB, Apple iPhone 12 Pro 512GB, and Apple iPhone 12 mini 64GB.',\n",
       " 'Yes, there are several budget laptops in the API response. For example, the HP 14-dq0055dx and HP 15-dw0083wm are both priced at $199.99 and $244.99 respectively.',\n",
       " 'The cheapest gaming PC available is the Alarco Gaming PC (X_BLACK_GTX750) for $499.99. You can find more information about it here: https://www.klarna.com/us/shopping/pl/cl223/3203154750/Desktop-Computers/Alarco-Gaming-PC-%28X_BLACK_GTX750%29/?utm_source=openai&ref-site=openai_plugin',\n",
       " 'Yes, there are several tablets under $400. These include the Apple iPad 10.2\" 32GB (2019), Samsung Galaxy Tab A8 10.5 SM-X200 32GB, Samsung Galaxy Tab A7 Lite 8.7 SM-T220 32GB, Amazon Fire HD 8\" 32GB (10th Generation), and Amazon Fire HD 10 32GB.',\n",
       " 'It looks like you are looking for the best headphones. Based on the API response, it looks like the Apple AirPods Pro (2nd generation) 2022, Apple AirPods Max, and Bose Noise Cancelling Headphones 700 are the best options.',\n",
       " 'The top rated laptops based on the API response are the Apple MacBook Pro (2021) M1 Pro 8C CPU 14C GPU 16GB 512GB SSD 14\", Apple MacBook Pro (2022) M2 OC 10C GPU 8GB 256GB SSD 13.3\", Apple MacBook Air (2022) M2 OC 8C GPU 8GB 256GB SSD 13.6\", and Apple MacBook Pro (2023) M2 Pro OC 16C GPU 16GB 512GB SSD 14.2\".',\n",
       " \"I found several Nike and Adidas shoes in the API response. Here are the links to the products: Nike Dunk Low M - Black/White: https://www.klarna.com/us/shopping/pl/cl337/3200177969/Shoes/Nike-Dunk-Low-M-Black-White/?utm_source=openai&ref-site=openai_plugin, Nike Air Jordan 4 Retro M - Midnight Navy: https://www.klarna.com/us/shopping/pl/cl337/3202929835/Shoes/Nike-Air-Jordan-4-Retro-M-Midnight-Navy/?utm_source=openai&ref-site=openai_plugin, Nike Air Force 1 '07 M - White: https://www.klarna.com/us/shopping/pl/cl337/3979297/Shoes/Nike-Air-Force-1-07-M-White/?utm_source=openai&ref-site=openai_plugin, Nike Dunk Low W - White/Black: https://www.klarna.com/us/shopping/pl/cl337/3200134705/Shoes/Nike-Dunk-Low-W-White-Black/?utm_source=openai&ref-site=openai_plugin, Nike Air Jordan 1 Retro High M - White/University Blue/Black: https://www.klarna.com/us/shopping/pl/cl337/3200383658/Shoes/Nike-Air-Jordan-1-Retro-High-M-White-University-Blue-Black/?utm_source=openai&ref-site=openai_plugin, Nike Air Jordan 1 Retro High OG M - True Blue/Cement Grey/White: https://www.klarna.com/us/shopping/pl/cl337/3204655673/Shoes/Nike-Air-Jordan-1-Retro-High-OG-M-True-Blue-Cement-Grey-White/?utm_source=openai&ref-site=openai_plugin, Nike Air Jordan 11 Retro Cherry - White/Varsity Red/Black: https://www.klarna.com/us/shopping/pl/cl337/3202929696/Shoes/Nike-Air-Jordan-11-Retro-Cherry-White-Varsity-Red-Black/?utm_source=openai&ref-site=openai_plugin, Nike Dunk High W - White/Black: https://www.klarna.com/us/shopping/pl/cl337/3201956448/Shoes/Nike-Dunk-High-W-White-Black/?utm_source=openai&ref-site=openai_plugin, Nike Air Jordan 5 Retro M - Black/Taxi/Aquatone: https://www.klarna.com/us/shopping/pl/cl337/3204923084/Shoes/Nike-Air-Jordan-5-Retro-M-Black-Taxi-Aquatone/?utm_source=openai&ref-site=openai_plugin, Nike Court Legacy Lift W: https://www.klarna.com/us/shopping/pl/cl337/3202103728/Shoes/Nike-Court-Legacy-Lift-W/?utm_source=openai&ref-site=openai_plugin\",\n",
       " \"I found several skirts that may interest you. Please take a look at the following products: Avenue Plus Size Denim Stretch Skirt, LoveShackFancy Ruffled Mini Skirt - Antique White, Nike Dri-Fit Club Golf Skirt - Active Pink, Skims Soft Lounge Ruched Long Skirt, French Toast Girl's Front Pleated Skirt with Tabs, Alexia Admor Women's Harmonie Mini Skirt Pink Pink, Vero Moda Long Skirt, Nike Court Dri-FIT Victory Flouncy Tennis Skirt Women - White/Black, Haoyuan Mini Pleated Skirts W, and Zimmermann Lyre Midi Skirt.\",\n",
       " 'Based on the API response, you may want to consider the Skytech Archangel Gaming Computer PC Desktop, the CyberPowerPC Gamer Master Gaming Desktop, or the ASUS ROG Strix G10DK-RS756, as they all offer powerful processors and plenty of RAM.',\n",
       " 'Based on the API response, the best budget cameras are the DJI Mini 2 Dog Camera ($448.50), Insta360 Sphere with Landing Pad ($429.99), DJI FPV Gimbal Camera ($121.06), Parrot Camera & Body ($36.19), and DJI FPV Air Unit ($179.00).']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "answers = [res['output'] for res in chain_outputs]\n",
    "answers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "484f0587",
   "metadata": {},
   "source": [
    "## Evaluate the requests chain\n",
    "\n",
    "The API Chain has two main components:\n",
    "1. Translate the user query to an API request (request synthesizer)\n",
    "2. Translate the API response to a natural language response\n",
    "\n",
    "Here, we construct an evaluation chain to grade the request synthesizer against selected human queries "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3ea5afd7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "truth_queries = [json.dumps(data[\"expected_query\"]) for data in dataset]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "e055f24b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Collect the API queries generated by the chain\n",
    "predicted_queries = [output[\"intermediate_steps\"][\"request_args\"] for output in chain_outputs]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "7d4f2b88",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "template = \"\"\"You are trying to answer the following question by querying an API:\n",
    "\n",
    "> Question: {question}\n",
    "\n",
    "The query you know you should be executing against the API is:\n",
    "\n",
    "> Query: {truth_query}\n",
    "\n",
    "Is the following predicted query semantically the same (eg likely to produce the same answer)?\n",
    "\n",
    "> Predicted Query: {predict_query}\n",
    "\n",
    "Please give the Predicted Query a grade of either an A, B, C, D, or F, along with an explanation of why. End the evaluation with 'Final Grade: <the letter>'\n",
    "\n",
    "> Explanation: Let's think step by step.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate.from_template(template)\n",
    "\n",
    "eval_chain = LLMChain(llm=llm, prompt=prompt, verbose=verbose)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8cc1b1db",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[' The original query is asking for all iPhone models, so the \"q\" parameter is correct. The \"max_price\" parameter is also correct, as it is set to null, meaning that no maximum price is set. The predicted query adds two additional parameters, \"size\" and \"min_price\". The \"size\" parameter is not necessary, as it is not relevant to the question being asked. The \"min_price\" parameter is also not necessary, as it is not relevant to the question being asked and it is set to 0, which is the default value. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: D',\n",
       " ' The original query is asking for laptops with a maximum price of 300. The predicted query is asking for laptops with a minimum price of 0 and a maximum price of 500. This means that the predicted query is likely to return more results than the original query, as it is asking for a wider range of prices. Therefore, the predicted query is not semantically the same as the original query, and it is not likely to produce the same answer. Final Grade: F',\n",
       " \" The first two parameters are the same, so that's good. The third parameter is different, but it's not necessary for the query, so that's not a problem. The fourth parameter is the problem. The original query specifies a maximum price of 500, while the predicted query specifies a maximum price of null. This means that the predicted query will not limit the results to the cheapest gaming PCs, so it is not semantically the same as the original query. Final Grade: F\",\n",
       " ' The original query is asking for tablets under $400, so the first two parameters are correct. The predicted query also includes the parameters \"size\" and \"min_price\", which are not necessary for the original query. The \"size\" parameter is not relevant to the question, and the \"min_price\" parameter is redundant since the original query already specifies a maximum price. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: D',\n",
       " ' The original query is asking for headphones with no maximum price, so the predicted query is not semantically the same because it has a maximum price of 500. The predicted query also has a size of 10, which is not specified in the original query. Therefore, the predicted query is not semantically the same as the original query. Final Grade: F',\n",
       " \" The original query is asking for the top rated laptops, so the 'size' parameter should be set to 10 to get the top 10 results. The 'min_price' parameter should be set to 0 to get results from all price ranges. The 'max_price' parameter should be set to null to get results from all price ranges. The 'q' parameter should be set to 'laptop' to get results related to laptops. All of these parameters are present in the predicted query, so it is semantically the same as the original query. Final Grade: A\",\n",
       " ' The original query is asking for shoes, so the predicted query is asking for the same thing. The original query does not specify a size, so the predicted query is not adding any additional information. The original query does not specify a price range, so the predicted query is adding additional information that is not necessary. Therefore, the predicted query is not semantically the same as the original query and is likely to produce different results. Final Grade: D',\n",
       " ' The original query is asking for a skirt, so the predicted query is asking for the same thing. The predicted query also adds additional parameters such as size and price range, which could help narrow down the results. However, the size parameter is not necessary for the query to be successful, and the price range is too narrow. Therefore, the predicted query is not as effective as the original query. Final Grade: C',\n",
       " ' The first part of the query is asking for a Desktop PC, which is the same as the original query. The second part of the query is asking for a size of 10, which is not relevant to the original query. The third part of the query is asking for a minimum price of 0, which is not relevant to the original query. The fourth part of the query is asking for a maximum price of null, which is not relevant to the original query. Therefore, the Predicted Query does not semantically match the original query and is not likely to produce the same answer. Final Grade: F',\n",
       " ' The original query is asking for cameras with a maximum price of 300. The predicted query is asking for cameras with a maximum price of 500. This means that the predicted query is likely to return more results than the original query, which may include cameras that are not within the budget range. Therefore, the predicted query is not semantically the same as the original query and does not answer the original question. Final Grade: F']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "request_eval_results = []\n",
    "for question, predict_query, truth_query in list(zip(questions, predicted_queries, truth_queries)):\n",
    "    eval_output = eval_chain.run(\n",
    "        question=question,\n",
    "        truth_query=truth_query,\n",
    "        predict_query=predict_query,\n",
    "    )\n",
    "    request_eval_results.append(eval_output)\n",
    "request_eval_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "0d76f8ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from typing import List\n",
    "# Parse the evaluation chain responses into a rubric\n",
    "def parse_eval_results(results: List[str]) -> List[float]:\n",
    "    rubric = {\n",
    "        \"A\": 1.0,\n",
    "        \"B\": 0.75,\n",
    "        \"C\": 0.5,\n",
    "        \"D\": 0.25,\n",
    "        \"F\": 0\n",
    "    }\n",
    "    return [rubric[re.search(r'Final Grade: (\\w+)', res).group(1)] for res in results]\n",
    "\n",
    "\n",
    "parsed_results = parse_eval_results(request_eval_results)\n",
    "# Collect the scores for a final evaluation table\n",
    "scores['request_synthesizer'].extend(parsed_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f3ee8ea",
   "metadata": {},
   "source": [
    "## Evaluate the Response Chain\n",
    "\n",
    "The second component translated the structured API response to a natural language response.\n",
    "Evaluate this against the user's original question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "8b97847c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "template = \"\"\"You are trying to answer the following question by querying an API:\n",
    "\n",
    "> Question: {question}\n",
    "\n",
    "The API returned a response of:\n",
    "\n",
    "> API result: {api_response}\n",
    "\n",
    "Your response to the user: {answer}\n",
    "\n",
    "Please evaluate the accuracy and utility of your response to the user's original question, conditioned on the information available.\n",
    "Give a letter grade of either an A, B, C, D, or F, along with an explanation of why. End the evaluation with 'Final Grade: <the letter>'\n",
    "\n",
    "> Explanation: Let's think step by step.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate.from_template(template)\n",
    "\n",
    "eval_chain = LLMChain(llm=llm, prompt=prompt, verbose=verbose)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "642852ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract the API responses from the chain\n",
    "api_responses = [output[\"intermediate_steps\"][\"response_text\"] for output in chain_outputs]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "08a5eb4f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[' The original query is asking for all iPhone models, so the \"q\" parameter is correct. The \"max_price\" parameter is also correct, as it is set to null, meaning that no maximum price is set. The predicted query adds two additional parameters, \"size\" and \"min_price\". The \"size\" parameter is not necessary, as it is not relevant to the question being asked. The \"min_price\" parameter is also not necessary, as it is not relevant to the question being asked and it is set to 0, which is the default value. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: D',\n",
       " ' The original query is asking for laptops with a maximum price of 300. The predicted query is asking for laptops with a minimum price of 0 and a maximum price of 500. This means that the predicted query is likely to return more results than the original query, as it is asking for a wider range of prices. Therefore, the predicted query is not semantically the same as the original query, and it is not likely to produce the same answer. Final Grade: F',\n",
       " \" The first two parameters are the same, so that's good. The third parameter is different, but it's not necessary for the query, so that's not a problem. The fourth parameter is the problem. The original query specifies a maximum price of 500, while the predicted query specifies a maximum price of null. This means that the predicted query will not limit the results to the cheapest gaming PCs, so it is not semantically the same as the original query. Final Grade: F\",\n",
       " ' The original query is asking for tablets under $400, so the first two parameters are correct. The predicted query also includes the parameters \"size\" and \"min_price\", which are not necessary for the original query. The \"size\" parameter is not relevant to the question, and the \"min_price\" parameter is redundant since the original query already specifies a maximum price. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: D',\n",
       " ' The original query is asking for headphones with no maximum price, so the predicted query is not semantically the same because it has a maximum price of 500. The predicted query also has a size of 10, which is not specified in the original query. Therefore, the predicted query is not semantically the same as the original query. Final Grade: F',\n",
       " \" The original query is asking for the top rated laptops, so the 'size' parameter should be set to 10 to get the top 10 results. The 'min_price' parameter should be set to 0 to get results from all price ranges. The 'max_price' parameter should be set to null to get results from all price ranges. The 'q' parameter should be set to 'laptop' to get results related to laptops. All of these parameters are present in the predicted query, so it is semantically the same as the original query. Final Grade: A\",\n",
       " ' The original query is asking for shoes, so the predicted query is asking for the same thing. The original query does not specify a size, so the predicted query is not adding any additional information. The original query does not specify a price range, so the predicted query is adding additional information that is not necessary. Therefore, the predicted query is not semantically the same as the original query and is likely to produce different results. Final Grade: D',\n",
       " ' The original query is asking for a skirt, so the predicted query is asking for the same thing. The predicted query also adds additional parameters such as size and price range, which could help narrow down the results. However, the size parameter is not necessary for the query to be successful, and the price range is too narrow. Therefore, the predicted query is not as effective as the original query. Final Grade: C',\n",
       " ' The first part of the query is asking for a Desktop PC, which is the same as the original query. The second part of the query is asking for a size of 10, which is not relevant to the original query. The third part of the query is asking for a minimum price of 0, which is not relevant to the original query. The fourth part of the query is asking for a maximum price of null, which is not relevant to the original query. Therefore, the Predicted Query does not semantically match the original query and is not likely to produce the same answer. Final Grade: F',\n",
       " ' The original query is asking for cameras with a maximum price of 300. The predicted query is asking for cameras with a maximum price of 500. This means that the predicted query is likely to return more results than the original query, which may include cameras that are not within the budget range. Therefore, the predicted query is not semantically the same as the original query and does not answer the original question. Final Grade: F',\n",
       " ' The user asked a question about what iPhone models are available, and the API returned a response with 10 different models. The response provided by the user accurately listed all 10 models, so the accuracy of the response is A+. The utility of the response is also A+ since the user was able to get the exact information they were looking for. Final Grade: A+',\n",
       " \" The API response provided a list of laptops with their prices and attributes. The user asked if there were any budget laptops, and the response provided a list of laptops that are all priced under $500. Therefore, the response was accurate and useful in answering the user's question. Final Grade: A\",\n",
       " \" The API response provided the name, price, and URL of the product, which is exactly what the user asked for. The response also provided additional information about the product's attributes, which is useful for the user to make an informed decision. Therefore, the response is accurate and useful. Final Grade: A\",\n",
       " \" The API response provided a list of tablets that are under $400. The response accurately answered the user's question. Additionally, the response provided useful information such as the product name, price, and attributes. Therefore, the response was accurate and useful. Final Grade: A\",\n",
       " \" The API response provided a list of headphones with their respective prices and attributes. The user asked for the best headphones, so the response should include the best headphones based on the criteria provided. The response provided a list of headphones that are all from the same brand (Apple) and all have the same type of headphone (True Wireless, In-Ear). This does not provide the user with enough information to make an informed decision about which headphones are the best. Therefore, the response does not accurately answer the user's question. Final Grade: F\",\n",
       " ' The API response provided a list of laptops with their attributes, which is exactly what the user asked for. The response provided a comprehensive list of the top rated laptops, which is what the user was looking for. The response was accurate and useful, providing the user with the information they needed. Final Grade: A',\n",
       " ' The API response provided a list of shoes from both Adidas and Nike, which is exactly what the user asked for. The response also included the product name, price, and attributes for each shoe, which is useful information for the user to make an informed decision. The response also included links to the products, which is helpful for the user to purchase the shoes. Therefore, the response was accurate and useful. Final Grade: A',\n",
       " \" The API response provided a list of skirts that could potentially meet the user's needs. The response also included the name, price, and attributes of each skirt. This is a great start, as it provides the user with a variety of options to choose from. However, the response does not provide any images of the skirts, which would have been helpful for the user to make a decision. Additionally, the response does not provide any information about the availability of the skirts, which could be important for the user. \\n\\nFinal Grade: B\",\n",
       " ' The user asked for a professional desktop PC with no budget constraints. The API response provided a list of products that fit the criteria, including the Skytech Archangel Gaming Computer PC Desktop, the CyberPowerPC Gamer Master Gaming Desktop, and the ASUS ROG Strix G10DK-RS756. The response accurately suggested these three products as they all offer powerful processors and plenty of RAM. Therefore, the response is accurate and useful. Final Grade: A',\n",
       " \" The API response provided a list of cameras with their prices, which is exactly what the user asked for. The response also included additional information such as features and memory cards, which is not necessary for the user's question but could be useful for further research. The response was accurate and provided the user with the information they needed. Final Grade: A\"]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Run the grader chain\n",
    "response_eval_results = []\n",
    "for question, api_response, answer in list(zip(questions, api_responses, answers)):\n",
    "    request_eval_results.append(eval_chain.run(question=question, api_response=api_response, answer=answer))\n",
    "request_eval_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "a144aa9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reusing the rubric from above, parse the evaluation chain responses\n",
    "parsed_response_results = parse_eval_results(request_eval_results)\n",
    "# Collect the scores for a final evaluation table\n",
    "scores['result_synthesizer'].extend(parsed_response_results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "e95042bc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Metric              \tMin       \tMean      \tMax       \n",
      "completed           \t1.00      \t1.00      \t1.00      \n",
      "request_synthesizer \t0.00      \t0.23      \t1.00      \n",
      "result_synthesizer  \t0.00      \t0.55      \t1.00      \n"
     ]
    }
   ],
   "source": [
    "# Print out Score statistics for the evaluation session\n",
    "header = \"{:<20}\\t{:<10}\\t{:<10}\\t{:<10}\".format(\"Metric\", \"Min\", \"Mean\", \"Max\")\n",
    "print(header)\n",
    "for metric, metric_scores in scores.items():\n",
    "    mean_scores = sum(metric_scores) / len(metric_scores) if len(metric_scores) > 0 else float('nan')\n",
    "    row = \"{:<20}\\t{:<10.2f}\\t{:<10.2f}\\t{:<10.2f}\".format(metric, min(metric_scores), mean_scores, max(metric_scores))\n",
    "    print(row)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "03fe96af",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Re-show the examples for which the chain failed to complete\n",
    "failed_examples"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bb3636d",
   "metadata": {},
   "source": [
    "## Generating Test Datasets\n",
    "\n",
    "To evaluate a chain against your own endpoint, you'll want to generate a test dataset that's conforms to the API.\n",
    "\n",
    "This section provides an overview of how to bootstrap the process.\n",
    "\n",
    "First, we'll parse the OpenAPI Spec. For this example, we'll [Speak](https://www.speak.com/)'s OpenAPI specification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "a453eb93",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n",
      "Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n"
     ]
    }
   ],
   "source": [
    "# Load and parse the OpenAPI Spec\n",
    "spec = OpenAPISpec.from_url(\"https://api.speak.com/openapi.yaml\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "bb65ffe8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['/v1/public/openai/explain-phrase',\n",
       " '/v1/public/openai/explain-task',\n",
       " '/v1/public/openai/translate']"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# List the paths in the OpenAPI Spec\n",
    "paths = sorted(spec.paths.keys())\n",
    "paths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "0988f01b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['post']"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# See which HTTP Methods are available for a given path\n",
    "methods = spec.get_methods_for_path('/v1/public/openai/explain-task')\n",
    "methods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "e9ef0a77",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type explainTask = (_: {\n",
      "/* Description of the task that the user wants to accomplish or do. For example, \"tell the waiter they messed up my order\" or \"compliment someone on their shirt\" */\n",
      "  task_description?: string,\n",
      "/* The foreign language that the user is learning and asking about. The value can be inferred from question - for example, if the user asks \"how do i ask a girl out in mexico city\", the value should be \"Spanish\" because of Mexico City. Always use the full name of the language (e.g. Spanish, French). */\n",
      "  learning_language?: string,\n",
      "/* The user's native language. Infer this value from the language the user asked their question in. Always use the full name of the language (e.g. Spanish, French). */\n",
      "  native_language?: string,\n",
      "/* A description of any additional context in the user's question that could affect the explanation - e.g. setting, scenario, situation, tone, speaking style and formality, usage notes, or any other qualifiers. */\n",
      "  additional_context?: string,\n",
      "/* Full text of the user's question. */\n",
      "  full_query?: string,\n",
      "}) => any;\n"
     ]
    }
   ],
   "source": [
    "# Load a single endpoint operation\n",
    "operation = APIOperation.from_openapi_spec(spec, '/v1/public/openai/explain-task', 'post')\n",
    "\n",
    "# The operation can be serialized as typescript\n",
    "print(operation.to_typescript())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "f1186b6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compress the service definition to avoid leaking too much input structure to the sample data\n",
    "template = \"\"\"In 20 words or less, what does this service accomplish?\n",
    "{spec}\n",
    "\n",
    "Function: It's designed to \"\"\"\n",
    "prompt = PromptTemplate.from_template(template)\n",
    "generation_chain = LLMChain(llm=llm, prompt=prompt)\n",
    "purpose = generation_chain.run(spec=operation.to_typescript())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "a594406a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"Can you explain how to say 'hello' in Spanish?\",\n",
       " \"I need help understanding the French word for 'goodbye'.\",\n",
       " \"Can you tell me how to say 'thank you' in German?\",\n",
       " \"I'm trying to learn the Italian word for 'please'.\",\n",
       " \"Can you help me with the pronunciation of 'yes' in Portuguese?\",\n",
       " \"I'm looking for the Dutch word for 'no'.\",\n",
       " \"Can you explain the meaning of 'hello' in Japanese?\",\n",
       " \"I need help understanding the Russian word for 'thank you'.\",\n",
       " \"Can you tell me how to say 'goodbye' in Chinese?\",\n",
       " \"I'm trying to learn the Arabic word for 'please'.\"]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "template = \"\"\"Write a list of {num_to_generate} unique messages users might send to a service designed to{purpose} They must each be completely unique.\n",
    "\n",
    "1.\"\"\"\n",
    "def parse_list(text: str) -> List[str]:\n",
    "    # Match lines starting with a number then period\n",
    "    # Strip leading and trailing whitespace\n",
    "    matches = re.findall(r'^\\d+\\. ', text)\n",
    "    return [re.sub(r'^\\d+\\. ', '', q).strip().strip('\"') for q in text.split('\\n')]\n",
    "\n",
    "num_to_generate = 10 # How many examples to use for this test set.\n",
    "prompt = PromptTemplate.from_template(template)\n",
    "generation_chain = LLMChain(llm=llm, prompt=prompt)\n",
    "text = generation_chain.run(purpose=purpose,\n",
    "                            num_to_generate=num_to_generate)\n",
    "# Strip preceding numeric bullets\n",
    "queries = parse_list(text)\n",
    "queries\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "8dc60f43",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['{\"task_description\": \"say \\'hello\\'\", \"learning_language\": \"Spanish\", \"native_language\": \"English\", \"full_query\": \"Can you explain how to say \\'hello\\' in Spanish?\"}',\n",
       " '{\"task_description\": \"understanding the French word for \\'goodbye\\'\", \"learning_language\": \"French\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the French word for \\'goodbye\\'.\"}',\n",
       " '{\"task_description\": \"say \\'thank you\\'\", \"learning_language\": \"German\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say \\'thank you\\' in German?\"}',\n",
       " '{\"task_description\": \"Learn the Italian word for \\'please\\'\", \"learning_language\": \"Italian\", \"native_language\": \"English\", \"full_query\": \"I\\'m trying to learn the Italian word for \\'please\\'.\"}',\n",
       " '{\"task_description\": \"Help with pronunciation of \\'yes\\' in Portuguese\", \"learning_language\": \"Portuguese\", \"native_language\": \"English\", \"full_query\": \"Can you help me with the pronunciation of \\'yes\\' in Portuguese?\"}',\n",
       " '{\"task_description\": \"Find the Dutch word for \\'no\\'\", \"learning_language\": \"Dutch\", \"native_language\": \"English\", \"full_query\": \"I\\'m looking for the Dutch word for \\'no\\'.\"}',\n",
       " '{\"task_description\": \"Explain the meaning of \\'hello\\' in Japanese\", \"learning_language\": \"Japanese\", \"native_language\": \"English\", \"full_query\": \"Can you explain the meaning of \\'hello\\' in Japanese?\"}',\n",
       " '{\"task_description\": \"understanding the Russian word for \\'thank you\\'\", \"learning_language\": \"Russian\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the Russian word for \\'thank you\\'.\"}',\n",
       " '{\"task_description\": \"say goodbye\", \"learning_language\": \"Chinese\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say \\'goodbye\\' in Chinese?\"}',\n",
       " '{\"task_description\": \"Learn the Arabic word for \\'please\\'\", \"learning_language\": \"Arabic\", \"native_language\": \"English\", \"full_query\": \"I\\'m trying to learn the Arabic word for \\'please\\'.\"}']"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Define the generation chain to get hypotheses\n",
    "api_chain = OpenAPIEndpointChain.from_api_operation(\n",
    "    operation, \n",
    "    llm, \n",
    "    requests=Requests(), \n",
    "    verbose=verbose,\n",
    "    return_intermediate_steps=True # Return request and response text\n",
    ")\n",
    "\n",
    "predicted_outputs =[api_chain(query) for query in queries]\n",
    "request_args = [output[\"intermediate_steps\"][\"request_args\"] for output in predicted_outputs]\n",
    "\n",
    "# Show the generated request\n",
    "request_args"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "b727e28e",
   "metadata": {},
   "outputs": [],
   "source": [
    "## AI Assisted Correction\n",
    "correction_template = \"\"\"Correct the following API request based on the user's feedback. If the user indicates no changes are needed, output the original without making any changes.\n",
    "\n",
    "REQUEST: {request}\n",
    "\n",
    "User Feedback / requested changes: {user_feedback}\n",
    "\n",
    "Finalized Request: \"\"\"\n",
    "\n",
    "prompt = PromptTemplate.from_template(correction_template)\n",
    "correction_chain = LLMChain(llm=llm, prompt=prompt)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "c1f4d71f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query: Can you explain how to say 'hello' in Spanish?\n",
      "Request: {\"task_description\": \"say 'hello'\", \"learning_language\": \"Spanish\", \"native_language\": \"English\", \"full_query\": \"Can you explain how to say 'hello' in Spanish?\"}\n",
      "Requested changes: \n",
      "Query: I need help understanding the French word for 'goodbye'.\n",
      "Request: {\"task_description\": \"understanding the French word for 'goodbye'\", \"learning_language\": \"French\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the French word for 'goodbye'.\"}\n",
      "Requested changes: \n",
      "Query: Can you tell me how to say 'thank you' in German?\n",
      "Request: {\"task_description\": \"say 'thank you'\", \"learning_language\": \"German\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say 'thank you' in German?\"}\n",
      "Requested changes: \n",
      "Query: I'm trying to learn the Italian word for 'please'.\n",
      "Request: {\"task_description\": \"Learn the Italian word for 'please'\", \"learning_language\": \"Italian\", \"native_language\": \"English\", \"full_query\": \"I'm trying to learn the Italian word for 'please'.\"}\n",
      "Requested changes: \n",
      "Query: Can you help me with the pronunciation of 'yes' in Portuguese?\n",
      "Request: {\"task_description\": \"Help with pronunciation of 'yes' in Portuguese\", \"learning_language\": \"Portuguese\", \"native_language\": \"English\", \"full_query\": \"Can you help me with the pronunciation of 'yes' in Portuguese?\"}\n",
      "Requested changes: \n",
      "Query: I'm looking for the Dutch word for 'no'.\n",
      "Request: {\"task_description\": \"Find the Dutch word for 'no'\", \"learning_language\": \"Dutch\", \"native_language\": \"English\", \"full_query\": \"I'm looking for the Dutch word for 'no'.\"}\n",
      "Requested changes: \n",
      "Query: Can you explain the meaning of 'hello' in Japanese?\n",
      "Request: {\"task_description\": \"Explain the meaning of 'hello' in Japanese\", \"learning_language\": \"Japanese\", \"native_language\": \"English\", \"full_query\": \"Can you explain the meaning of 'hello' in Japanese?\"}\n",
      "Requested changes: \n",
      "Query: I need help understanding the Russian word for 'thank you'.\n",
      "Request: {\"task_description\": \"understanding the Russian word for 'thank you'\", \"learning_language\": \"Russian\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the Russian word for 'thank you'.\"}\n",
      "Requested changes: \n",
      "Query: Can you tell me how to say 'goodbye' in Chinese?\n",
      "Request: {\"task_description\": \"say goodbye\", \"learning_language\": \"Chinese\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say 'goodbye' in Chinese?\"}\n",
      "Requested changes: \n",
      "Query: I'm trying to learn the Arabic word for 'please'.\n",
      "Request: {\"task_description\": \"Learn the Arabic word for 'please'\", \"learning_language\": \"Arabic\", \"native_language\": \"English\", \"full_query\": \"I'm trying to learn the Arabic word for 'please'.\"}\n",
      "Requested changes: \n"
     ]
    }
   ],
   "source": [
    "ground_truth = []\n",
    "for query, request_arg in list(zip(queries, request_args)):\n",
    "    feedback = input(f\"Query: {query}\\nRequest: {request_arg}\\nRequested changes: \")\n",
    "    if feedback == 'n' or feedback == 'none' or not feedback:\n",
    "        ground_truth.append(request_arg)\n",
    "        continue\n",
    "    resolved = correction_chain.run(request=request_arg,\n",
    "                            user_feedback=feedback)\n",
    "    ground_truth.append(resolved.strip())\n",
    "    print(\"Updated request:\", resolved)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19d68882",
   "metadata": {},
   "source": [
    "**Now you can use the `ground_truth` as shown above in [Evaluate the Requests Chain](#Evaluate-the-requests-chain)!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "5a596176",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['{\"task_description\": \"say \\'hello\\'\", \"learning_language\": \"Spanish\", \"native_language\": \"English\", \"full_query\": \"Can you explain how to say \\'hello\\' in Spanish?\"}',\n",
       " '{\"task_description\": \"understanding the French word for \\'goodbye\\'\", \"learning_language\": \"French\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the French word for \\'goodbye\\'.\"}',\n",
       " '{\"task_description\": \"say \\'thank you\\'\", \"learning_language\": \"German\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say \\'thank you\\' in German?\"}',\n",
       " '{\"task_description\": \"Learn the Italian word for \\'please\\'\", \"learning_language\": \"Italian\", \"native_language\": \"English\", \"full_query\": \"I\\'m trying to learn the Italian word for \\'please\\'.\"}',\n",
       " '{\"task_description\": \"Help with pronunciation of \\'yes\\' in Portuguese\", \"learning_language\": \"Portuguese\", \"native_language\": \"English\", \"full_query\": \"Can you help me with the pronunciation of \\'yes\\' in Portuguese?\"}',\n",
       " '{\"task_description\": \"Find the Dutch word for \\'no\\'\", \"learning_language\": \"Dutch\", \"native_language\": \"English\", \"full_query\": \"I\\'m looking for the Dutch word for \\'no\\'.\"}',\n",
       " '{\"task_description\": \"Explain the meaning of \\'hello\\' in Japanese\", \"learning_language\": \"Japanese\", \"native_language\": \"English\", \"full_query\": \"Can you explain the meaning of \\'hello\\' in Japanese?\"}',\n",
       " '{\"task_description\": \"understanding the Russian word for \\'thank you\\'\", \"learning_language\": \"Russian\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the Russian word for \\'thank you\\'.\"}',\n",
       " '{\"task_description\": \"say goodbye\", \"learning_language\": \"Chinese\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say \\'goodbye\\' in Chinese?\"}',\n",
       " '{\"task_description\": \"Learn the Arabic word for \\'please\\'\", \"learning_language\": \"Arabic\", \"native_language\": \"English\", \"full_query\": \"I\\'m trying to learn the Arabic word for \\'please\\'.\"}']"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Now you have a new ground truth set to use as shown above!\n",
    "ground_truth"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7fe9dfa",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}