LeetCode 코딩 테스트 - DNA Pattern Recognition(LV.Medium)

[SQL]

LeetCode 코딩 테스트 - DNA Pattern Recognition(LV.Medium)

indongspace 2025. 4. 1. 21:11

Table: Samples

+----------------+---------+
| Column Name    | Type    | 
+----------------+---------+
| sample_id      | int     |
| dna_sequence   | varchar |
| species        | varchar |
+----------------+---------+
sample_id is the unique key for this table.
Each row contains a DNA sequence represented as a string of characters (A, T, G, C) and the species it was collected from.

Biologists are studying basic patterns in DNA sequences. Write a solution to identify sample_id with the following patterns:

Sequences that start with ATG (a common start codon)
Sequences that end with either TAA, TAG, or TGA (stop codons)
Sequences containing the motif ATAT (a simple repeated pattern)
Sequences that have at least 3 consecutive G (like GGG or GGGG)

Return the result table ordered by sample_id in ascending order.

The result format is in the following example.

Example:

Input:

Samples table:

+-----------+------------------+-----------+
| sample_id | dna_sequence     | species   |
+-----------+------------------+-----------+
| 1         | ATGCTAGCTAGCTAA  | Human     |
| 2         | GGGTCAATCATC     | Human     |
| 3         | ATATATCGTAGCTA   | Human     |
| 4         | ATGGGGTCATCATAA  | Mouse     |
| 5         | TCAGTCAGTCAG     | Mouse     |
| 6         | ATATCGCGCTAG     | Zebrafish |
| 7         | CGTATGCGTCGTA    | Zebrafish |
+-----------+------------------+-----------+

Output:

+-----------+------------------+-------------+-------------+------------+------------+------------+
| sample_id | dna_sequence     | species     | has_start   | has_stop   | has_atat   | has_ggg    |
+-----------+------------------+-------------+-------------+------------+------------+------------+
| 1         | ATGCTAGCTAGCTAA  | Human       | 1           | 1          | 0          | 0          |
| 2         | GGGTCAATCATC     | Human       | 0           | 0          | 0          | 1          |
| 3         | ATATATCGTAGCTA   | Human       | 0           | 0          | 1          | 0          |
| 4         | ATGGGGTCATCATAA  | Mouse       | 1           | 1          | 0          | 1          |
| 5         | TCAGTCAGTCAG     | Mouse       | 0           | 0          | 0          | 0          |
| 6         | ATATCGCGCTAG     | Zebrafish   | 0           | 1          | 1          | 0          |
| 7         | CGTATGCGTCGTA    | Zebrafish   | 0           | 0          | 0          | 0          |
+-----------+------------------+-------------+-------------+------------+------------+------------+

Explanation:

Sample 1 (ATGCTAGCTAGCTAA):
- Starts with ATG (has_start = 1)
- Ends with TAA (has_stop = 1)
- Does not contain ATAT (has_atat = 0)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)
Sample 2 (GGGTCAATCATC):
- Does not start with ATG (has_start = 0)
- Does not end with TAA, TAG, or TGA (has_stop = 0)
- Does not contain ATAT (has_atat = 0)
- Contains GGG (has_ggg = 1)
Sample 3 (ATATATCGTAGCTA):
- Does not start with ATG (has_start = 0)
- Does not end with TAA, TAG, or TGA (has_stop = 0)
- Contains ATAT (has_atat = 1)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)
Sample 4 (ATGGGGTCATCATAA):
- Starts with ATG (has_start = 1)
- Ends with TAA (has_stop = 1)
- Does not contain ATAT (has_atat = 0)
- Contains GGGG (has_ggg = 1)
Sample 5 (TCAGTCAGTCAG):
- Does not match any patterns (all fields = 0)
Sample 6 (ATATCGCGCTAG):
- Does not start with ATG (has_start = 0)
- Ends with TAG (has_stop = 1)
- Starts with ATAT (has_atat = 1)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)
Sample 7 (CGTATGCGTCGTA):
- Does not start with ATG (has_start = 0)
- Does not end with TAA, "TAG", or "TGA" (has_stop = 0)
- Does not contain ATAT (has_atat = 0)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)

Note:

The result is ordered by sample_id in ascending order
For each pattern, 1 indicates the pattern is present and 0 indicates it is not present

# 쿼리를 작성하는 목표, 확인할 지표 : dna_sequence의 글자로 인한 분류 / dna_sequence
# 쿼리 계산 방법 : 1. ATG로 시작하는지, TAA로 끝나는지, ATAT를 포함하는지, GGG를 포함하는지 CASE WHEN 으로 분류 -> 2. 정렬
# 데이터의 기간 : x
# 사용할 테이블 : samples
# JOIN KEY : x
# 데이터 특징 : x
SELECT
    sample_id,
    dna_sequence,
    species,
    # 1
    IF(dna_sequence LIKE('ATG%'), 1, 0) AS has_start,
    IF(dna_sequence LIKE('%TAA') OR dna_sequence LIKE('%TAG') OR dna_sequence LIKE('%TGA'), 1, 0) AS has_stop,
    IF(dna_sequence LIKE('%ATAT%'), 1, 0) AS has_atat,
    IF(dna_sequence REGEXP 'G{3,}', 1, 0) AS has_ggg
FROM samples
# 2
ORDER BY
    sample_id

# REGEXP : 정규 표현식(Regular Expression)을 사용하여 패턴 매칭을 수행하는 연산자
# G{3,} : G(문자), {3,}(최소 3번 이상 반복 = 3개 이상 연속)
# 다만 REGEXP는 대소문자를 구분하지 않음. (ggg도 인식)

'[SQL]' 카테고리의 다른 글

LeetCode 코딩 테스트 - Odd and Even Transactions(LV.Medium) (0)	2025.04.04
LeetCode 코딩 테스트 - Analyze Subscription Conversion(LV.Medium) (0)	2025.04.02
LeetCode 코딩 테스트 - Confirmation Rate(LV.Medium) (0)	2025.03.30
LeetCode 코딩 테스트 - Count Salary Categories(LV.Medium) (0)	2025.03.27
LeetCode 코딩 테스트 - Nth Highest Salary(LV.Medium) (0)	2025.03.25

현재글LeetCode 코딩 테스트 - DNA Pattern Recognition(LV.Medium)

인동머스크

" 우리에게는 존재하지 않는 것들을 꿈꿀 수 있는 사람들이 필요하다. " (ADsP / 빅데이터분석기사 / SQLD) https://github.com/Indongspace

hackerrank, 티스토리챌린지, 태블로, 메타코드, 카일스쿨, mysql, 오블완, 코딩테스트, ADsP, 복습, tableau, 데이터자격검정, 프로그래머스, 빅데이터분석기사, leetcode, 통계기초, 인프런, 시각화, vizlab, 빅분기,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

인동머스크