# System architectures with adaptive accelerators for genomics

Sang-Woo Jun

Assistant Professor, Department of Computer Science University of California, Irvine



# System architectures with adaptive accelerators for genomics

Sang-Woo Jun Assistant Professor, Department of Computer Science University of California, Irvine





## Working to Bridge the Silos...



[3] Database Architects, "The Great CPU Stagnation" 2023[4] Marvell 2020 Investor day – Slide 43



[3] Database Architects, "The Great CPU Stagnation" 2023[4] Marvell 2020 Investor day – Slide 43



Performance





[3] Database Architects, "The Great CPU Stagnation" 2023[4] Marvell 2020 Investor day – Slide 43

Performance





## **Specialization for Performance & Efficiency**



## **Specialization for Performance & Efficiency**



## **Specialization for Performance & Efficiency**











**NVIDIA**, with no solid competition, is out here competing against **Moore's Law** instead.

0



- Volta V100 (2017) 12nm ~\$10,000 at release 32 bit CUDA: ~14 TELOPS  $\cap$  32 bit tensor: ~112 TFLOPS  $\circ$  21B transistors – ~300 W - 815 mm<sup>2</sup> Ampere – A100 (2020) – 7nm ~\$10,000 at release 32 bit CUDA: ~19.5 TFLOPS o 32 bit tensor: ~156 TFLOPS \*TF32 != FP32!  $\circ$  52B transistors – ~300 W - 826 mm<sup>2</sup> Hopper – H100 (2022) – 4nm ~\$25,000 at release 32 bit CUDA: ~67 TFLOPS 32 bit tensor: ~400 TFLOPS (higher with sparsity support) \*TF32 != FP32! Ο 80B transistors – ~300 W - 814 mm<sup>2</sup>  $\cap$ Blackwell – B100 (2024) – 4nm ~\$35,000 at release 32 bit CUDA: ~60 TFLOPS  $\cap$ 32 bit tensor: ~900 TFLOPS (higher with sparsity support) \*TF32 != FP32! Ο
  - $\circ$  208B transistors ?? mm<sup>2</sup>

- Volta V100 (2017) 12nm ~\$10,000 at release
  - o 32 bit CUDA: ~14 TFLOPS
  - 32 bit tensor: ~112 TFLOPS
  - 21B transistors ~300 W 815 mm<sup>2</sup>
- Ampere A100 (2020) 7nm ~\$10,000 at release
  - o 32 bit CUDA: ~19.5 TFLOPS
  - 32 bit tensor: ~156 TFLOPS \*TF32 != FP32!
  - 52B transistors ~300 W 826 mm<sup>2</sup>
- □ Hopper H100 (2022) 4nm ~\$25,000 at release
  - o 32 bit CUDA: ~67 TFLOPS
  - 32 bit tensor: ~400 TFLOPS (higher with sparsity support) \*TF32 != FP32!
  - 80B transistors ~300 W 814 mm<sup>2</sup>
- □ Blackwell B100 (2024) 4nm ~\$35,000 at release
  - 32 bit CUDA: ~60 TFLOPS
  - o 32 bit tensor: ~900 TFLOPS (higher with sparsity support) \*TF32 != FP32!
  - 208B transistors ?? mm<sup>2</sup>

Volta – V100 (2017) – 12nm ~\$10,000 at release
32 bit CUDA: ~14 TFLOPS
32 bit tensor: ~112 TFLOPS
21B transistors – ~300 W - 815 mm<sup>2</sup>

#### Ampere – A100 (2020) – 7nm ~\$10,000 at release

- <u>32 bit CUDA: ~19.5 TFLOPS</u>
- o 32 bit tensor: ~156 TFLOPS \*TF32 != FP32!
- 52B transistors ~300 W 826 mm<sup>2</sup>

#### □ Hopper – H100 (2022) – 4nm ~\$25,000 at release

- 32 bit CUDA: ~67 TFLOPS
- 32 bit tensor: ~400 TFLOPS (higher with sparsity support) \*TF32 != FP32!
- 80B transistors ~300 W 814 mm<sup>2</sup>
- □ Blackwell B100 (2024) 4nm ~\$35,000 at release
  - <u>32 bit CUDA: ~60 TFLOPS</u>
  - 32 bit tensor: ~900 TFLOPS (higher with sparsity support) \*TF32 != FP32!
  - 208B transistors ?? mm<sup>2</sup>





**NVIDIA**, with no solid competition, is out here competing against **Moore's Law** instead.



30

**NVIDIA**, with no solid competition, is out here competing against **Moore's Law** instead.



30

What about the rest of us?







Irregular computation patterns

Irregular memory accesses

Graphs larger than GPU memory

Low warp utilization















Isn't GPU throughput supposed to be a multi-TFLOP?



Isn't GPU throughput supposed to be a multi-TFLOP?



Irregular memory accesses

Graphs larger than GPU memory



Can FPGAs save us?


#### An Example: Graph Neural Networks!



Can FPGAs save us?



#### An Example: Graph Neural Networks!



Can FPGAs save us?



#### An Example: Graph Neural Networks!



Can FPGAs save us?

Not by itself!









![](_page_42_Figure_1.jpeg)

#### **Repeated discovery:**

Algorithm and system architecture must co-optimize with hardware acceleration!

![](_page_42_Figure_4.jpeg)

Not today's topic...

#### But, We Need More Performance!

![](_page_43_Figure_1.jpeg)

**NVIDIA**, with no solid competition, is out here competing against **Moore's Law** instead.

0

![](_page_43_Figure_3.jpeg)

https://epoch.ai/blog/machine-learning-model-sizes-and-the-parameter-gap

### But, We Need More Performance!

![](_page_44_Figure_1.jpeg)

NVIDIA, with no solid competition, is out here competing against Moore's Law instead.

![](_page_44_Figure_3.jpeg)

https://epoch.ai/blog/machine-learning-model-sizes-and-the-parameter-gap

### But, We Need More Performance!

![](_page_45_Figure_1.jpeg)

**NVIDIA**, with no solid competition, is out here competing against **Moore's Law** instead.

0

![](_page_45_Figure_3.jpeg)

https://epoch.ai/blog/machine-learning-model-sizes-and-the-parameter-gap

![](_page_46_Picture_1.jpeg)

**Cancer Patient** 

![](_page_47_Figure_1.jpeg)

![](_page_48_Picture_1.jpeg)

![](_page_49_Picture_1.jpeg)

![](_page_50_Picture_1.jpeg)

#### **Genome Assembly Methods**

Long read samples

![](_page_52_Figure_0.jpeg)

![](_page_52_Figure_1.jpeg)

![](_page_52_Figure_2.jpeg)

![](_page_53_Figure_0.jpeg)

![](_page_53_Figure_1.jpeg)

**De-Novo Assembly** 

![](_page_54_Figure_0.jpeg)

![](_page_54_Figure_1.jpeg)

**De-Novo Assembly** 

[1] Chaisson, Mark JP, Richard K. Wilson, and Evan E. Eichler. "Genetic variation and the de novo assembly of human genomes." Nature Reviews Genetics 16.11 (2015): 627-640.
[2] Ashley, Euan A. "Towards precision medicine." Nature Reviews Genetics 17.9 (2016): 507-522.
[3] Meyn, Stephen. "A critical tool for human genomics and precision medicine: De novo human genome assembly." University of Wisconsin–Madison Research Blog

![](_page_55_Figure_0.jpeg)

![](_page_55_Figure_1.jpeg)

**De-Novo Assembly** 

[1] Chaisson, Mark JP, Richard K. Wilson, and Evan E. Eichler. "Genetic variation and the de novo assembly of human genomes." Nature Reviews Genetics 16.11 (2015): 627-640.
[2] Ashley, Euan A. "Towards precision medicine." Nature Reviews Genetics 17.9 (2016): 507-522.
[3] Meyn, Stephen. "A critical tool for human genomics and precision medicine: De novo human genome assembly." University of Wisconsin–Madison Research Blog

## Genome Assembly Methods

![](_page_56_Figure_1.jpeg)

[1] Chaisson, Mark JP, Richard K. Wilson, and Evan E. Eichler. "Genetic variation and the de novo assembly of human genomes." Nature Reviews Genetics 16.11 (2015): 627-640.
[2] Ashley, Euan A. "Towards precision medicine." Nature Reviews Genetics 17.9 (2016): 507-522.
[3] Meyn, Stephen. "A critical tool for human genomics and precision medicine: De novo human genome assembly." University of Wisconsin–Madison Research Blog

### De Novo Assembly for Personalized Medicine

"However, *de novo* assembly, particularly of short reads, is computationally intense and impractical for clinical genome sequencing" <sup>[2]</sup>

### De Novo Assembly for Personalized Medicine

"However, *de novo* assembly, particularly of short reads, is computationally intense and impractical for clinical genome sequencing" <sup>[2]</sup>

"We have been running a single NextDenovo instance for 1 year on a 1 TB AWS instance. We hope it will finish soon" -- One of our research collaborators

### De Novo Assembly for Personalized Medicine

"However, *de novo* assembly, particularly of short reads, is computationally intense and impractical for clinical genome sequencing" <sup>[2]</sup>

"We have been running a single NextDenovo instance for 1 year on a 1 TB AWS instance. We hope it will finish soon" -- One of our research collaborators

Hurrah! A systems research problem!

| Correction                                                  |                            |                                                |
|-------------------------------------------------------------|----------------------------|------------------------------------------------|
| Step                                                        | Mem (GB)                   | Time (s)                                       |
| Raw_align<br>(minimap2)                                     | 9                          | 2,203                                          |
| Sort                                                        | < 9                        | 176                                            |
| Next_correct                                                | < 9                        | 1,851                                          |
| Alignment                                                   |                            |                                                |
|                                                             |                            |                                                |
| Step                                                        | Mem (GB)                   | Time (s)                                       |
| Step<br>Cns_align<br>(minimap2)                             | Mem (GB)<br>8              | Time (s)<br>3,907                              |
| Step<br>Cns_align<br>(minimap2)<br>Ctg_graph                | Mem (GB)<br>8<br>< 8       | Time (s)         3,907         7.8             |
| StepCns_align<br>(minimap2)Ctg_graphCtg_align<br>(minimap2) | Mem (GB)<br>8<br>< 8<br>12 | Time (s)         3,907         7.8         390 |

|                        | Correction                                                  |                            |                                                |
|------------------------|-------------------------------------------------------------|----------------------------|------------------------------------------------|
|                        | Step                                                        | Mem (GB)                   | Time (s)                                       |
| Acceleration<br>Target | Raw_align<br>(minimap2)                                     | 9                          | 2,203                                          |
|                        | Sort                                                        | < 9                        | 176                                            |
| Acceleration<br>Target | Next_correct                                                | < 9                        | 1,851                                          |
|                        | , angrinnent                                                |                            |                                                |
|                        |                                                             |                            |                                                |
|                        | Step                                                        | Mem (GB)                   | Time (s)                                       |
| Acceleration<br>Target | Step<br>Cns_align<br>(minimap2)                             | <b>Mem (GB)</b><br>8       | Time (s)<br>3,907                              |
| Acceleration<br>Target | Step<br>Cns_align<br>(minimap2)<br>Ctg_graph                | Mem (GB)<br>8<br>< 8       | Time (s)         3,907         7.8             |
| Acceleration<br>Target | StepCns_align<br>(minimap2)Ctg_graphCtg_align<br>(minimap2) | Mem (GB)<br>8<br>< 8<br>12 | Time (s)         3,907         7.8         390 |

|                        | Correction                                                      |                |                     |
|------------------------|-----------------------------------------------------------------|----------------|---------------------|
|                        | Step                                                            | Mem (GB)       | Time (s)            |
| Acceleration<br>Target | Raw_align<br>(minimap2)                                         | 9              | 2,203               |
|                        | Sort                                                            | < 9            | 176                 |
| Acceleration           | Next_correct                                                    | < 9            | 1,851               |
| larget                 | Alignment                                                       |                |                     |
|                        | Step                                                            | Mem (GB)       | Time (s)            |
|                        |                                                                 |                |                     |
| Acceleration<br>Target | Cns_align<br>(minimap2)                                         | 8              | 3,907               |
| Acceleration<br>Target | Cns_align<br>(minimap2)<br>Ctg_graph                            | 8<br>< 8       | 3,907<br>7.8        |
| Acceleration<br>Target | Cns_align<br>(minimap2)<br>Ctg_graph<br>Ctg_align<br>(minimap2) | 8<br>< 8<br>12 | 3,907<br>7.8<br>390 |

![](_page_63_Figure_1.jpeg)

![](_page_64_Figure_1.jpeg)

![](_page_65_Figure_1.jpeg)

#### □ Big source of scalability concerns: Handling graphs

• Overlap graphs, De Bruijn Graphs, String Graphs, ...

![](_page_66_Figure_3.jpeg)

Kalyanaraman, A. (2011). Genome Assembly. In: Padua, D. (eds) Encyclopedia of Parallel Computing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09766-4\_402

Big source of scalability concerns: Handling graphs

- Overlap graphs, De Bruijn Graphs, String Graphs, ...
- Quite large!
  - +500 GB for Human
  - TBs for some plants (Pine, Onion, ...)

![](_page_67_Figure_6.jpeg)

Kalyanaraman, A. (2011). Genome Assembly. In: Padua, D. (eds) Encyclopedia of Parallel Computing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09766-4\_402

Big source of scalability concerns: Handling graphs

- Overlap graphs, De Bruijn Graphs, String Graphs, ...
- Quite large!
  - +500 GB for Human
  - TBs for some plants (Pine, Onion, ...)
- Vertices are small (few bytes)

![](_page_68_Figure_7.jpeg)

Kalyanaraman, A. (2011). Genome Assembly. In: Padua, D. (eds) Encyclopedia of Parallel Computing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09766-4\_402

Big source of scalability concerns: Handling graphs

- Overlap graphs, De Bruijn Graphs, String Graphs, ...
- Quite large!
  - +500 GB for Human
  - TBs for some plants (Pine, Onion, ...)
- $\circ$  Vertices are small (few bytes)
- Construct, then traverse

![](_page_69_Figure_8.jpeg)

Big source of scalability concerns: Handling graphs

- Overlap graphs, De Bruijn Graphs, String Graphs, ...
- Quite large!
  - +500 GB for Human
  - TBs for some plants (Pine, Onion, ...)
- $\circ$  Vertices are small (few bytes)
- Construct, then traverse

Irregular computation patterns

Large memory requirements

#### Not readily parallelizable

![](_page_70_Figure_11.jpeg)

![](_page_70_Figure_12.jpeg)

#### High-Performance Graph Analytics in SSDs

- Slowing DRAM density scaling
   Graphs scaling faster than memory can!
- □ SSDs are cheaper... Can we use those instead?

![](_page_71_Figure_3.jpeg)

![](_page_71_Figure_4.jpeg)
Slowing DRAM density scaling

□ Graphs scaling faster than memory can!

□ SSDs are cheaper... Can we use those instead?







Slowing DRAM density scaling

□ Graphs scaling faster than memory can!

□ SSDs are cheaper... Can we use those instead?



Unfortunately, they are also slow...





- Slowing DRAM density scaling
  - □ Graphs scaling faster than memory can!
- □ SSDs are cheaper... Can we use those instead?







- Slowing DRAM density scaling
  - □ Graphs scaling faster than memory can!
- □ SSDs are cheaper... Can we use those instead?





























Software must issue many non-blocking access requests!



- Targeting near-storage acceleration (e.g., SmartSSD)
- □ Key idea: Asynchronous query with callback
  - Programmer-specified callback function called when data is ready

- ☐ Targeting near-storage acceleration (e.g., SmartSSD)
- □ Key idea: Asynchronous query with callback
  - Programmer-specified callback function called when data is ready

```
foreach vertex.getNeighbors(callback=myCallback)
function myCallback(src, dst[]) begin
... application-specific logic ...
end
```

- Targeting near-storage acceleration (e.g., SmartSSD)
- □ Key idea: Asynchronous query with callback

- Out-of-order, Latency-Insensitive
- Programmer-specified callback function called when data is ready

```
foreach vertex.getNeighbors(callback=myCallback)
function myCallback(src, dst[]) begin
... application-specific logic ...
end
```

- Targeting near-storage acceleration (e.g., SmartSSD)
- □ Key idea: Asynchronous query with callback

- Out-of-order, Latency-Insensitive
- Programmer-specified callback function called when data is ready

```
foreach vertex.getNeighbors(callback=myCallback)
function myCallback(src, dst[]) begin
... application-specific logic ...
end
```

- Many queries can be in flight at once (>millions)

- Targeting near-storage acceleration (e.g., SmartSSD)
- □ Key idea: Asynchronous query with callback
  - Programmer-specified callback function called when data is ready

```
foreach vertex.getNeighbors(callback=myCallback)
```

```
function myCallback(src, dst[]) begin
... application-specific logic ...
end
```

- Many queries can be in flight at once (>millions)



Out-of-order,

- Targeting near-storage acceleration (e.g., SmartSSD)
- □ Key idea: Asynchronous query with callback
  - Programmer-specified callback function called when data is ready

```
foreach vertex.getNeighbors(callback=myCallback)
```

```
function myCallback(src, dst[]) begin
... application-specific logic ...
end
```

- Many queries can be in flight at once (>millions)
- Storage access latency can be hidden





- Targeting near-storage acceleration (e.g., SmartSSD)
- □ Key idea: Asynchronous query with callback
  - Programmer-specified callback function called when data is ready

```
foreach vertex.getNeighbors(callback=myCallback)
```

```
function myCallback(src, dst[]) begin
... application-specific logic ...
end
```

- Many queries can be in flight at once (>millions)
- Storage access latency can be hidden
- Transparently group accesses to the same page



Out-of-order,

- Targeting near-storage acceleration (e.g., SmartSSD)
- □ Key idea: Asynchronous query with callback
  - Programmer-specified callback function called when data is ready

```
foreach vertex.getNeighbors(callback=myCallback)
```

```
function myCallback(src, dst[]) begin
... application-specific logic ...
end
```

- Many queries can be in flight at once (>millions)
- Storage access latency can be hidden
- Transparently group accesses to the same page



Out-of-order,

- Targeting near-storage acceleration (e.g., SmartSSD)
- □ Key idea: Asynchronous query with callback
  - Programmer-specified callback function called when data is ready

```
foreach vertex.getNeighbors(callback=myCallback)
```

```
function myCallback(src, dst[]) begin
... application-specific logic ...
end
```

- Many queries can be in flight at once (>millions)
- Storage access latency can be hidden
- Transparently group accesses to the same page Minimize I/O amplification!



Out-of-order,

- Targeting near-storage acceleration (e.g., SmartSSD)
- □ Key idea: Asynchronous query with callback
  - Programmer-specified callback function called when data is ready

```
foreach vertex.getNeighbors(callback=myCallback)
```

```
function myCallback(src, dst[]) begin
... application-specific logic ...
end
```

- Many queries can be in flight at once (>millions)
- Storage access latency can be hidden
- Transparently group accesses to the same page Minimize I/O amplification!
- Other transparent optimizations can be hidden



Out-of-order,

# A Library of Optimizations to Hide

- □ Access re-organization (Done)
  - $\circ~$  Burst-sorting accelerator to group accesses to the same page

## A Library of Optimizations to Hide

- □ Access re-organization (Done)
  - $\circ~$  Burst-sorting accelerator to group accesses to the same page
- Probabilistic filtering (Done)
  - $\circ$  Use bloom filter to avoid storage reads which will return negative results
  - o e.g., Nonexistent graph edges, Nodes with no outgoing edge

## A Library of Optimizations to Hide

- □ Access re-organization (Done)
  - Burst-sorting accelerator to group accesses to the same page
- Probabilistic filtering (Done)
  - $\circ~$  Use bloom filter to avoid storage reads which will return negative results
  - e.g., Nonexistent graph edges, Nodes with no outgoing edge
- □ Compression (In Progress)
  - Application-specific compression, e.g., LZ4, ZFP, XOR, VarInt
  - Reference-based compression

## Preliminary Evaluation: Triangle Counting

- Counts the number of triangles in a graph
- Important application
  - One of four benchmarks in MIT/Lincoln Labs GraphChallenge<sup>[9]</sup>
- Involves two neighborhood queries
  - For each V, enumerate permutations of neighbor(V) → (A,B) check whether B ∈ neighbor(A)
  - Bloom filter trained on graph edges Avoid neighborhood queries for A if edge(A,B) doesn't exist



#### **Experimental Setup**

- □ State-of-the-art baselines:
  - GraphBLAS
  - HPEC graph challenge champions: Karypis (CPU), TRUST (GPU)
  - A lot more which failed from memory limitations (e.g., Neo4J)
- Dell T640 server w/ 24-Core Xeon Gold and 200 GB DRAM, V100 GPU
  - + <u>One</u> Samsung SmartSSD for SSD+FPGA
  - Our approach only used 4 threads + 4 GB memory

#### **Experimental Setup**

- □ State-of-the-art baselines:
  - GraphBLAS
  - HPEC graph challenge champions: Karypis (CPU), TRUST (GPU)
  - A lot more which failed from memory limitations (e.g., Neo4J)
- Dell T640 server w/ 24-Core Xeon Gold and 200 GB DRAM, V100 GPU
  - + <u>One</u> Samsung SmartSSD for SSD+FPGA
  - Our approach only used 4 threads + 4 GB memory

| Graph    | Edge # (Billion) |
|----------|------------------|
| DARPA    | 0.44             |
| V1r      | 0.46             |
| MAWI     | 0.48             |
| Graph500 | 1.05             |
| Twitter  | 1.46             |





<sup>1</sup>/<sub>4</sub> Cost, Comparable performance





## What to accelerate, for De Novo Assembly?


#### What to accelerate, for De Novo Assembly?



#### □ Many De Novo tools internally use "Minimap2"

- Input: Reference, reads
- $\circ~$  Output: mapping between them
- De Novo does not use a reference, reads act also as reference
  - Massively increased work: 10x or more!



























- Random-access during hash construction
- Random-access during hash lookup



- Random-access during hash construction
- Random-access during hash lookup





- Random-access during hash construction
- Random-access during hash lookup





- Random-access during hash construction
- Random-access during hash lookup



- Backtracking needs fast clock (CPU?)
- Score matrix is too large... (PCIe bottleneck!)



- Random-access during hash construction
- Random-access during hash lookup



- Backtracking needs fast clock (CPU?)
- Score matrix is too large... (PCIe bottleneck!)





- Random-access during hash construction
- Random-access during hash lookup



- Backtracking needs fast clock (CPU?)
- Score matrix is too large... (PCIe bottleneck!)



Score matrix computation: Parallel



- Random-access during hash construction
- Random-access during hash lookup



- Backtracking needs fast clock (CPU?)
- Score matrix is too large... (PCIe bottleneck!)



- Score matrix computation: Parallel
- Score matrix: Large



- Random-access during hash construction
- Random-access during hash lookup



- Backtracking needs fast clock (CPU?)
- Score matrix is too large... (PCIe bottleneck!)



- Score matrix computation: Parallel
- Score matrix: Large
- Backtracking: Sequential



- Random-access during hash construction
- Random-access during hash lookup



- Backtracking needs fast clock (CPU?) ٠
- Score matrix is too large... (PCIe bottleneck!) •



- Score matrix computation: Parallel
- Score matrix: Large
- **Backtracking: Sequential**
- Solution 1: Compress the matrix



- Random-access during hash construction
- Random-access during hash lookup



- Backtracking needs fast clock (CPU?)
- Score matrix is too large... (PCIe bottleneck!)





- Random-access during hash construction
- Random-access during hash lookup



- Backtracking needs fast clock (CPU?) ٠
- Score matrix is too large... (PCIe bottleneck!) •



Solution 2: Parallel backtracking





#### Long-Term Goal: Precision ("Personalized") Medicine



#### Our Efforts So Far...

| ISCA 2018      | NVM + FPGA  | vertex-centric graph analytics  |  |
|----------------|-------------|---------------------------------|--|
| Frontiers 2021 | NVM         | Genomic graphs (SMuFin)         |  |
| FPL 2022       | DRAM + FPGA | Genomic graphs (De Bruijn)      |  |
| PACT 2023      | NVM + FPGA  | Graph Neural Networks           |  |
| DAC 2024       | NVM + FPGA  | Software-Driven graph analytics |  |

#### Our Efforts So Far...

| ISCA 2018      | NVM + FP    | GA        | vertex-centric gr          | aph analytics         |
|----------------|-------------|-----------|----------------------------|-----------------------|
| Frontiers 2021 | NVM         |           | Genomic graphs (SMuFin)    |                       |
| FPL 2022       | DRAM + FPGA |           | Genomic graphs (De Bruijn) |                       |
| PACT 2023      | NVM + FP    | GA        | Graph Neural Ne            | etworks               |
| DAC 2024       | NVM + FP    | GA        | Software-Driven            | graph analytics       |
| Genome Compres | sion        | Graph Cor | npression                  | Parallel Backtracking |

#### ARDA is Interested in a LOT of things!

## ARDA is Interested in a LOT of things!

- Graph Neural Networks
- □ Edge processing Earthquakes and Wildfires
- □ Edge processing Smart Agriculture
- □ Processing-In-Memory
- Accelerating Program Analysis
- □ Scientific Computing Symbolic Regression

#### Oh my!

#### **Students Involved**



- PhD Se-Min Lim @ UCI
  - Scalable Graph Neural Networks with near-storage acceleration



- PhD Seongyoung Kang @ UCI
  - Scalable Subgraph Isomorphism with near-storage acceleration
  - Triangle counting demo being developed
    - Plan to present to Samsung collaborators (Xuebin Yao, Reza Soltaniyeh)



- PhD Esmerald Aliaj @ UCI
  - Compiler support for hardware kernel generation