History and Application of LLM Leveraging Big Data
dalgual
109 views
35 slides
Jul 16, 2024
Slide 1 of 35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
About This Presentation
Myself
Introduction To Big Data
Scalable Data Intensive Computing
Applications with ML & DL
Summary
Size: 6.56 MB
Language: en
Added: Jul 16, 2024
Slides: 35 pages
Slide Content
July 16 2024 YISS Yonsei University, Korea Jongwook Woo, PhD, [email protected] Big Data AI Center (BigDAI) California State University Los Angeles Application of LLM Leveraging Big Data
Contents Myself Introduction To Big Data Scalable Data Intensive Computing Applications with ML & DL Summary
Myself Experience: Since 2002, Professor at California State University Los Angeles Director at BigDAI (Big Data AI Center) PhD/MS in 2001/1998: Computer Science and Engineering at USC BS/MS in 1989/1991: Electronic Engineering, Yonsei University
Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits
Myself: Collaborations SOFT ZEN
Collaboration with NVidia, Databricks, Oracle, Amazon, CDH, Yonsei using Big Data AI https://www.cloudera.com/more/customers/csula.html
Contents Myself Introduction To Big Data Scalable Data Intensive Computing Applications with ML & DL Summary
Data Issues Large-Scale data Hourly, Daily, …: Tera -Byte (10 12 ), Peta-byte (10 15 ) Because of … IoT (Streaming data, Sensor Data) in SmartX Social Computing, smart phone, online game, web, Bioinformatics, … Legacy approach Too expensive to store and process large scale data
Data Handling: One Approach Make a new powerful systems with the bigger and expensive
Data Handling: One Approach IBM Mainframe Z15: T02 starts with $160,000 https://techcrunch.com/2019/09/12/the-mainframe-business-is-alive-and-well-as-ibm-announces-new-z15/
Data Handling: Another Way Less Expensive and More Scalable From 2017 Korean Blockbuster Movie, “The Fortress” ( 남한산성 ) AD 1409 (Year 9 of King Tae-Jong, Chosun Dynasty, Korea) By Choi family: 최해산 ( 崔海山 ), 아버지 최무선 ( 崔茂宣 ) [Ref] 조선의 비밀 병기 : 총통기 화차 ( 銃筒機火車 )| 작성자 도심 Use existing without spending more expenses
H/W: Leveraging Big Data Cluster with GPU
Data Handling: Another Way Less Expensive and More Scalable https://www.nextplatform.com/2021/09/15/the-endless-pursuit-of-scale-at-linkedin/ https://engineering.linkedin.com/blog/2021/scaling-linkedin-s-hadoop-yarn-cluster-beyond-10-000-nodes Linkedin Hadoop Spark Cluster: 10,000 nodes with 500PB of capacity through 2020
Big Data in Systems Big Data: Definition Again Non-expensive platform, which is distributed parallel computing systems and that can store a large-scale data and process it in parallel
Interviewed with Cloudera, Nov 12 2014 https://www.youtube.com/watch?v=ZvrHxsypeUE
Contents Myself Introduction To Big Data Scalable Data Intensive Computing Applications with ML & DL Summary
Jams and other traffic incidents reported by users in Dec 2017 – Jan 2018: (Dalyapraz Dauletbak)
Talked about COVID 19 at Arirang TV, 08/27/2020 Invited Talk about COVID 19 and Post-Pandemic era using Big Data AI at Arirang TV in Korea ,
Contents Myself Introduction To Big Data Scalable Data Intensive Computing Applications with ML & DL Summary
Big Data Prediction Big Data Science How to predict the future trend and pattern with the massive dataset ? Deep Learning Machine Learning AI
Deep Learning Example with Images GAN Neural Style Transfer with A Neural Algorithm of Artistic Style ( Gatys et al.). The Bathers, Korea, Yoon-Bok Shin 1858 - ?) Two Young Girls at the Piano, Auguste Renoir, French, 1892
Deep Learning Example (Cont’d) GAN
Contents Myself Introduction To Big Data Scalable Data Intensive Computing Applications with ML & DL Summary
Study ML/DL in Big Data AI Text Processing and Classification of Biz Reviews Data Amazon Products Ratings/Review, OpenTable Helpfulness, Transaction Fraud Detection, Traffics Product Recommendation Airbnb Listings, Amazon Product Price prediction Used Cars, Flight, Liquor Sales Community Service COVID 19: Confirmed Cases, Vaccine effects History books/data
Example: Text Data of History Books 『 明史 』 卷 41 志 17 地理 2 山東 遼東都指揮使司 鐵嶺衞 : 고려와의 경계인 철령성에 철령위를 설치하다 : https://bit.ly/3VefgLl Copy and paste the contents to Chat GPT or Copilot/ChatGPT For Example: 다음의 한자를 한국말로 번역해줘 : " 朝鮮 , 箕子所封國也 . 漢以前曰朝鮮 . 始爲燕人衞滿所據 , 漢武帝平之 , 置眞番 · 臨屯 · 樂浪 · 玄菟四郡 . 漢末 , 有扶餘人高氏據其地 , 改國號曰高麗 , 又曰高句麗 , 居平壤 , 即樂浪也 . 已 , 爲唐所破 , 東徙 . 後唐時 , 王建代高氏 , 兼併新羅 · 百濟地 , 徙居松岳 , 曰東京 , 而以平壤爲西京 . 其國北鄰契丹 , 西則女直 , 南曰日本 . 元至元中 , 西京內屬 , 置 東寧路 總管府 , 盡慈嶺爲界 .“
Example: Text Data of History Books (Cont’d) 『 明史 』 卷 41 志 17 地理 2 山東 遼東都指揮使司 鐵嶺 s ays “…. 西京內屬 , 置 東寧路 總管府 , 盡慈嶺爲界 . ” ChatGPT translates it “… 왕건이 고씨를 대신하여 신라와 백제의 땅을 흡수하고 성곡에 이주하여 동경 ( 東京 ) 이라 하였고 , 평양을 서경으로 하였습니다 . 그 나라는 북쪽으로는 거란 , 서쪽으로는 녀직 , 남쪽으로는 일본과 접합니다 . 원나라 시대 중에는 서경이 내려앉아 동녕로 총관부를 설치하여 자치선으로 하였고 , 전체적으로는 자평령을 경계로 하였습니다 . ”
Example (Cont’d) Open Baidu and search for “ 東寧路 ”
Example (Cont’d) 2. Copy the address of “ 東寧路 ” shown in Baidu: 东宁路
Example (Cont’d) 3. Paste the address 东宁路 吉林省辽源市东丰县 To Google Map
Example (Cont’d) 4. Compare what we have now and what we found at Google Map
NLP Rating Classifying Models in Open Table Performance Comparison of the Models LR has the shortest computation time precision and AUC of DistlBERT are 71.2 % and 70.5 % 4 - 14 % better Precision and AUC than traditional models Slightly better Precision and Accuracy than BERT 0.7 and 1%, respectively Algorithm Precision AUC Computing Time log (sec) LR 0.682 0.644 201 RF 0.667 0.524 634 GBT 0.637 0.616 3,141 BERT 0.707 0.694 11,936 DistilBERT 0.712 0.705 7,048 “Comparing NLP Models with LLM Classifying OpenTable Dataset”, H. Lin, S. Lee, J. Park, E. Lim, J. Woo, KrAIS 2024
Contents Myself Introduction To Big Data Scalable Data Intensive Computing Applications with ML & DL Summary
Summary Big Data platform for Large Scale Data LLM ChatGPT , Gemini, Bing: Your secretary, Translator, Advisor