--
Azure Data Engineering Interview Questions
Data engineering is a critical role in the world of data science and analytics.
In this article, I will be sharing questions that I have faced during the interview process i.e., asked in product and service-based companies.
- Write an SQL query to find a day on which the weather temperature is higher than the previous day.
create table weather
(id int,
recorddate date,
temperature int);insert into weather values(1,’2015–01–01',10),(2,’2015–01–02',25),(3,’2015–01–03',20),(4,’2015–01–04',30);
select t.Id from Weather as t, Weather as y
where Datediff (t.RecordDate, y.RecordDate) =1
and t.Temperature > y.Temperature;
2. Write a SQL to get the sum of total purchases by category?
if you see the total purchases of fruits; which is the sum of banana and Apple i.e.10, similarly the total purchases of the four vegetables should be the sum of the vegetables i.e.44
So can you please write a query SQL query that will take this as input and you will have a new column that is total purchases?
create table item
(item varchar(20),
purchases int,
category varchar(20));insert into item values(‘banana’,2,’fruit’),(‘apple’,8,’fruit’),(‘leek’,2,’vegetable’),(‘cabbage’,9,’vegetable’),(‘lettuce’,10,’vegetable’),(‘kale’,23,’vegetable’);
select item, purchases, category, sum(purchases)over(partition by category)as total_purchases from item ;
3. Can you write a query to display the cumulative total purchases like a banana, apple is there, so for a banana it is 2, and for Apple 8+2 so the total is 10. Similarly for vegetables: leek is 2, for cabbage it is 9 so it will be 9+2 i.e. 11.. & so on. For lettuce- 10+9+2 = 21 so on.
select item, purchases, category, sum(purchases)over(partition by category order by category,purchases)
as total_purchases from item ;
4. Write logic to print the print reverse of a number without using inbuilt functions.
n = int(input(“enter the number”)) #1234
reverse_num = 0
while n != 0:
reverse_num = (reverse_num*10) + (n %10)
n=n // 10
print(reverse_num)
5. Write logic to print even numbers from the list in Python.
lst = [1,2,3,4,5,6,7,8,9,10]
lst_even=[]
for i in lst:
if i%2 ==0:
lst_even.append(i)
print(lst_even)
6. Which salesman has sold the highest number of units each month?
create table SALES (SALES_DATE date ,SALESMAN_IDvarchar(20),PRODUCT_ID varchar(20),UNITS_SOLD int);
INSERT INTO SALES(SALES_DATE,SALESMAN_ID,PRODUCT_ID,UNITS_SOLD)VALUES
(‘2021–03–02’,’1',’2',75),
(‘2021–03–01’,’1',’2',50),
(‘2021–03–01’,’2',’1',40),
(‘2021–03–01’,’2',’1',60),
(‘2021–03–01’,’2',’1',60),
(‘2021–02–01’,’2',’1',50),
(‘2021–02–01’,’1',’2',50),
(‘2021–02–01’,’2',’1',50),
(‘2021–01–02’,’1',’2',35),
(‘2021–01–01’,’1',’1',35),
(‘2021–01–01’,’1',’2',48),
(‘2021–01–01’,’2',’1',34);
Solution:
select salesman_id, month,total_units_sold from (
select *, row_number()over(partition by month order by total_units_sold desc ) as row_num from (
select SALESMAN_ID, extract(month from SALES_DATE) as month, sum(UNITS_SOLD) as total_units_sold
from SALES
group by SALESMAN_ID,extract(month from SALES_DATE)
order by salesman_id,month ) as a) as b
where row_num=1;
7. Write a query to list out the student's ID who has scored greater than 50 marks in all the subjects?
create table stu(st_id int,sub_id varchar(20),marks int);
insert into stu values(1,’s1',60),(1,’s2',90),(1,’s3',70),(2,’s1',80),(2,’s2',40),(2,’s3',94),(3,’s1',73),(3,’s2',84),(3,’s3',52);
Solution:
select * from (
SELECT st_id,
MAX(CASE WHEN sub_id = ‘s1’ THEN marks ELSE NULL END)AS s1,
MAX(CASE WHEN sub_id = ‘s2’ THEN marks ELSE NULL END) AS s2,
MAX(CASE WHEN sub_id = ‘s3’ THEN marks ELSE NULL END) AS s3
FROM
(select st_id,sub_id ,marks,ROW_NUMBER() OVER (PARTITION BY st_id order by st_id ) as row_id from stu ) DS
GROUP BY st_id ORDER BY st_id) as a where s1>50 and s2>50 and s3>50;
8. Left Join and Full outer join for TableA and TableB.
Table A:- a, a, a, a
Table B:- a, a, a, a
Solution:
create table TableA( aid varchar(20));
create table TableB( bid varchar(20));insert into TableA(aid) values(‘a’),(‘a’),(‘a’),(‘a’);
insert into TableB(bid) values(‘a’),(‘a’),(‘a’),(‘a’);
9. Write a logic in Pyspark to find the max(salary) of an employee based on department vise.
Solution:
emp =[(1,’IT’,2000),(2,’IT’,3000),(3,’CS’,5000),(4,’CS’,6000),(5,’EC’,7000)]
emp_schema =[“emp_id”,”dept_name”,”salary”]
dept = [(101,’IT’,),(102,’CS’),(103,’EC’)]
dept_schema =[“dept_id”,”dept_name”]
df_emp = spark.createDataFrame(data=emp, schema=emp_schema)
df_dept = spark.createDataFrame(data=dept, schema=dept_schema)
df_join = df_emp.join(df_dept,df_emp.dept_name == df_dept.dept_name, “inner”)
df_join.show()
df= df_join.orderBy(col(“dept_id”).asc()).groupBy(“dept_id”).agg(max(“salary”).alias(“max_salary_by_dept”))
df.show()
10.
Input df
df1=id,date,s1,s2,s3. df2 = id,date,s1,s2,s3,s4
Output df
df3 = id,date,s1,s2,s3,s4
Example:
d1 = [[1,’A1'],[2,’A2'],[3,’A3']]
s1 = [“id”,” name”]
d2 = [[4],[5],[6]]
s2 = [“id”]
df1 = spark.createDataFrame(data=d1,schema=s1)
df2 = spark.createDataFrame(data=d2,schema=s2)
df1.unionByName(df2,allowMissingColumns = True).show()
+---+----+
| id|name|
+---+----+
| 1| A1|
| 2| A2|
| 3| A3|
| 4|null|
| 5|null|
| 6|null|
+---+----+
11. You need to find pairs of adults and child to go for a ride. The pair should be in a way that the oldest adult should be with the youngest child and so on...Get the scripts from the description box of the video.
Solution :
with cte_desc as(
select *,row_number()over(order by age desc) as rn_desc from Person1
where type =’Adult’
),
cte_asc as(
select *,row_number()over(order by age asc) as rn_asc from Person1
where type =’Child’
)
select a.person as ADULT,b.person as CHILD from cte_desc a
left join cte_asc b on a.rn_desc=b.rn_asc;
12. Write an SQL query to take the I/P and print the O/P as mentioned below.
Solution:
with cte as ( select *, sum(case when status=’Success’ and prev_status=’Failure’ then 1 else 0 end)over( order by date)
as grp_key from( select *,lag(status,1,status)over(order by date asc) as prev_status from rate)A)select status,min(date) as min_date, max(date) as max_date from cte group by grp_key,status;
13. Explain the types of join in Pyspark and SQL.
14. Explain the broadcast join, broadcast variable, and Semi and Anti join in Pyspark.
15. Optimization techniques in Pyspark.
16. Explain the architecture and working of spark.
17. Types of IR in azure and the difference between them.
18. Difference between tumbling window and schedule-based triggers.
19. Difference between Lookup and GetMetaData Activity.
20. CI/CD implementation for ADF pipeline.
21. Write an sql query to find non matching data from two tables.
Solution:1
create table emp1(id int); insert into emp1 values(1),(2),(3),(4);
create table emp2(id int); insert into emp2 values(1),(2),(5),(6);
(select emp1.id from emp1 union select emp2.id from emp2)
except
select emp1.id from emp1 inner join emp2 on emp1.id =emp2.id;
Solution2:
select emp1.id, emp2.id from emp1 full outer join emp2 on emp1.id = emp2.id where emp1.id is null or emp2.id is null;
22. Dense_rank() vs rank()
23. How to handle a scenario in ADF if a file which is already processed should not be processed again. (Hint: GetMetaDataActivity).
24. How to send email notification on an failure of an activity?(Hint: Azure LogicApp)
25. How to save logs of all the activities of ADF.(Hint: Enable Azure Log Analytics and read logs using KQL(Kusto Query Language)).
26. Write an SQL query to find the missing numbers from the set of given numbers.
Solution:
with recur as(
select min(id) as min_id ,max(id) as max_id from students
union all
select min_id +1, max_id from recur where min_id < max_id)
select min_id from recur where min_id not in(select id from students)
Note: This IE list will get keep on updating.
Thanks for the read🙏🙏.Do clap👏👏 if find it useful.🙂🙂
“Keep learning and keep sharing knowledge.”