Nidhi Gupta
7 min readJul 29, 2023

--

Azure Data Engineering Interview Questions

Data engineering plays a critical role in the world of data science and analytics.

In this article, I will be sharing questions that I have faced during the interview process i.e., asked in product and service-based companies.

  1. Write an SQL query to find a day on which the weather temperature is higher than the previous day.

create table weather
(id int,
recorddate date,
temperature int);

insert into weather values(1,’2015–01–01',10),(2,’2015–01–02',25),(3,’2015–01–03',20),(4,’2015–01–04',30);

select t.Id from Weather as t, Weather as y
where Datediff (t.RecordDate, y.RecordDate) =1
and t.Temperature > y.Temperature;

2. Write an SQL to get the sum of total purchases by category.
if you see the total purchases of fruits; which is the sum of banana and Apple i.e.10, similarly the total purchases of the four vegetables should be the sum of the vegetables i.e.44
So can you please write a query SQL query that will take this as input and you will have a new column that is total purchases?

create table item
(item varchar(20),
purchases int,
category varchar(20));

insert into item values(‘banana’,2,’fruit’),(‘apple’,8,’fruit’),(‘leek’,2,’vegetable’),(‘cabbage’,9,’vegetable’),(‘lettuce’,10,’vegetable’),(‘kale’,23,’vegetable’);

select item, purchases, category, sum(purchases)over(partition by category)as total_purchases from item ;

3. Can you write a query to display the cumulative total purchases like a banana, apple is there, so for a banana it is 2, and for Apple 8+2 so the total is 10. Similarly for vegetables: leek is 2, for cabbage it is 9 so it will be 9+2 i.e. 11.. & so on. For lettuce- 10+9+2 = 21 so on.

select item, purchases, category, sum(purchases)over(partition by category order by category,purchases)
as total_purchases from item ;

4. Write logic to print the print reverse of a number without using inbuilt functions.

n = int(input(“enter the number”)) #1234

reverse_num = 0

while n != 0:

reverse_num = (reverse_num*10) + (n %10)

n=n // 10

print(reverse_num)

5. Write logic to print even numbers from the list in Python.

lst = [1,2,3,4,5,6,7,8,9,10]

lst_even=[]

for i in lst:

if i%2 ==0:

lst_even.append(i)

print(lst_even)

6. Which salesman has sold the highest number of units each month?

create table SALES (SALES_DATE date ,SALESMAN_IDvarchar(20),PRODUCT_ID varchar(20),UNITS_SOLD int);

INSERT INTO SALES(SALES_DATE,SALESMAN_ID,PRODUCT_ID,UNITS_SOLD)VALUES
(‘2021–03–02’,’1',’2',75),
(‘2021–03–01’,’1',’2',50),
(‘2021–03–01’,’2',’1',40),
(‘2021–03–01’,’2',’1',60),
(‘2021–03–01’,’2',’1',60),
(‘2021–02–01’,’2',’1',50),
(‘2021–02–01’,’1',’2',50),
(‘2021–02–01’,’2',’1',50),
(‘2021–01–02’,’1',’2',35),
(‘2021–01–01’,’1',’1',35),
(‘2021–01–01’,’1',’2',48),
(‘2021–01–01’,’2',’1',34);

Solution:

select salesman_id, month,total_units_sold from (
select *, row_number()over(partition by month order by total_units_sold desc ) as row_num from (
select SALESMAN_ID, extract(month from SALES_DATE) as month, sum(UNITS_SOLD) as total_units_sold
from SALES
group by SALESMAN_ID,extract(month from SALES_DATE)
order by salesman_id,month ) as a) as b
where row_num=1;

7. Write a query to list out the student's ID who has scored greater than 50 marks in all the subjects.

create table stu(st_id int,sub_id varchar(20),marks int);

insert into stu values(1,’s1',60),(1,’s2',90),(1,’s3',70),(2,’s1',80),(2,’s2',40),(2,’s3',94),(3,’s1',73),(3,’s2',84),(3,’s3',52);

select * from stu;

Solution:

select * from (

SELECT st_id,

MAX(CASE WHEN sub_id = ‘s1’ THEN marks ELSE NULL END)AS s1,

MAX(CASE WHEN sub_id = ‘s2’ THEN marks ELSE NULL END) AS s2,

MAX(CASE WHEN sub_id = ‘s3’ THEN marks ELSE NULL END) AS s3

FROM

(select st_id,sub_id ,marks,ROW_NUMBER() OVER (PARTITION BY st_id order by st_id ) as row_id from stu ) DS

GROUP BY st_id ORDER BY st_id) as a where s1>50 and s2>50 and s3>50;

8. Left Join and Full outer join for TableA and TableB.

Table A:- a, a, a, a

Table B:- a, a, a, a

Solution:

create table TableA( aid varchar(20));
create table TableB( bid varchar(20));

insert into TableA(aid) values(‘a’),(‘a’),(‘a’),(‘a’);
insert into TableB(bid) values(‘a’),(‘a’),(‘a’),(‘a’);

Left Join
Full Outer Join

9. Write a logic in Pyspark to find the max(salary) of an employee based on department vise.

Solution:

emp =[(1,’IT’,2000),(2,’IT’,3000),(3,’CS’,5000),(4,’CS’,6000),(5,’EC’,7000)]

emp_schema =[“emp_id”,”dept_name”,”salary”]

dept = [(101,’IT’,),(102,’CS’),(103,’EC’)]

dept_schema =[“dept_id”,”dept_name”]

df_emp = spark.createDataFrame(data=emp, schema=emp_schema)

df_dept = spark.createDataFrame(data=dept, schema=dept_schema)

df_join = df_emp.join(df_dept,df_emp.dept_name == df_dept.dept_name, “inner”)

df_join.show()

df= df_join.orderBy(col(“dept_id”).asc()).groupBy(“dept_id”).agg(max(“salary”).alias(“max_salary_by_dept”))

df.show()

10.

Input df

df1=id,date,s1,s2,s3. df2 = id,date,s1,s2,s3,s4

Output df

df3 = id,date,s1,s2,s3,s4

Example:

d1 = [[1,’A1'],[2,’A2'],[3,’A3']]

s1 = [“id”,” name”]

d2 = [[4],[5],[6]]

s2 = [“id”]

df1 = spark.createDataFrame(data=d1,schema=s1)

df2 = spark.createDataFrame(data=d2,schema=s2)

df1.unionByName(df2,allowMissingColumns = True).show()

+---+----+
| id|name|
+---+----+
| 1| A1|
| 2| A2|
| 3| A3|
| 4|null|
| 5|null|
| 6|null|
+---+----+

11. You need to find pairs of adults and child to go for a ride. The pair should be in a way that the oldest adult should be with the youngest child and so on...Get the scripts from the description box of the video.

Solution :

INPUT

with cte_desc as(
select *,row_number()over(order by age desc) as rn_desc from Person1
where type =’Adult’
),
cte_asc as(
select *,row_number()over(order by age asc) as rn_asc from Person1
where type =’Child’
)
select a.person as ADULT,b.person as CHILD from cte_desc a
left join cte_asc b on a.rn_desc=b.rn_asc;

OUTPUT

12. Write an SQL query to take the I/P and print the O/P as mentioned below.

INPUT
OUTPUT

Solution:

with cte as ( select *, sum(case when status=’Success’ and prev_status=’Failure’ then 1 else 0 end)over( order by date) as grp_key from( select *,lag(status,1,status)over(order by date asc) as prev_status from rate)A)select status,min(date) as min_date, max(date) as max_date from cte group by grp_key,status;

13. Explain the types of join in Pyspark and SQL.

14. Explain the broadcast join, broadcast variable, and Semi and Anti join in Pyspark.

15. Optimization techniques in Pyspark.

16. Explain the architecture and working of spark.

17. Types of IR in azure and the difference between them.

18. Difference between tumbling window and schedule-based triggers.

19. Difference between Lookup and GetMetaData Activity.

20. CI/CD implementation for ADF pipeline.

21. Write an SQL query to find nonmatching data from two tables.

Solution:1

create table emp1(id int); insert into emp1 values(1),(2),(3),(4);

create table emp2(id int); insert into emp2 values(1),(2),(5),(6);

(select emp1.id from emp1 union select emp2.id from emp2)

except

select emp1.id from emp1 inner join emp2 on emp1.id =emp2.id;

Solution2:

select emp1.id, emp2.id from emp1 full outer join emp2 on emp1.id = emp2.id where emp1.id is null or emp2.id is null;

22. Dense_rank() vs rank()

23. How to handle a scenario in ADF if a file that is already processed should not be processed again. (Hint: GetMetaDataActivity).

24. How to send email notifications on a failure of an activity? (Hint: Azure LogicApp)

25. How to save logs of all the activities of ADF.(Hint: Enable Azure Log Analytics and read logs using KQL(Kusto Query Language)).

26. Write an SQL query to find the missing numbers from the set of given numbers.

Input
Output

Solution:

with recur as(

select min(id) as min_id ,max(id) as max_id from students

union all

select min_id +1, max_id from recur where min_id < max_id)

select min_id from recur where min_id not in(select id from students)

27. What is the difference between Access Control, Access Keys, and Shared Access Signature?

28. What is the purpose of having two keys?

29. What Types of activities are in the pipeline?

30. What is the difference between copy activity and lookup activity?

31. How to handle pipeline failures?

32. How to execute a pipeline?

33. What is the difference between a variable and a parameter?

34. How to execute a pipeline dynamically?

35. What do you understand by Mapping data Flows?

36. What is Integration Run Time? its types?

37. Components used in Azure Data Factory?

38. How do you pass parameters during pipeline runs?

39. How to handle NULL values?

40. What is the difference between Datasets and Linked Service?

41. Different ways to execute the Azure Data Factory pipeline?

42.

𝐏𝐫𝐨𝐛𝐥𝐞𝐦 𝐒𝐭𝐚𝐭𝐞𝐦𝐞𝐧𝐭:
Write a SQL Query 𝐭𝐨 𝐟𝐢𝐧𝐝 𝐭𝐫𝐚𝐝𝐞 𝐜𝐨𝐮𝐩𝐥𝐞𝐬 𝐬𝐚𝐭𝐢𝐬𝐟𝐲𝐢𝐧𝐠 𝐭𝐡𝐞 𝐟𝐨𝐥𝐥𝐨𝐰𝐢𝐧𝐠 𝐜𝐨𝐧𝐝𝐢𝐭𝐢𝐨𝐧𝐬 𝐟𝐨𝐫 𝐞𝐚𝐜𝐡 𝐬𝐭𝐨𝐜𝐤:

𝟏. 𝐓𝐡𝐞 𝐭𝐫𝐚𝐝𝐞𝐬 𝐬𝐡𝐨𝐮𝐥𝐝 𝐛𝐞 𝐰𝐢𝐭𝐡𝐢𝐧 𝟏𝟎 𝐬𝐞𝐜𝐨𝐧𝐝𝐬 𝐨𝐟 𝐰𝐢𝐧𝐝𝐨𝐰.
𝟐. 𝐓𝐡𝐞 𝐩𝐫𝐢𝐜𝐞𝐬 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐭𝐡𝐞 𝐭𝐫𝐚𝐝𝐞𝐬 𝐬𝐡𝐨𝐮𝐥𝐝 𝐛𝐞 𝐦𝐨𝐫𝐞 𝐭𝐡𝐚𝐧 𝟏𝟎%.

Solution:

Create Table trade_tbl(trade_id varchar(20),
trade_timestamp time,
trade_stock varchar(20),
quantity int,
price Float);

Insert into trade_tbl Values(‘TRADE1’,’10:01:05',’ITJunction4All’,100,20);
Insert into trade_tbl Values(‘TRADE2’,’10:01:06',’ITJunction4All’,20,15);
Insert into trade_tbl Values(‘TRADE3’,’10:01:08',’ITJunction4All’,150,30);
Insert into trade_tbl Values(‘TRADE4’,’10:01:09',’ITJunction4All’,300,32);
Insert into trade_tbl Values(‘TRADE5’,’10:10:00',’ITJunction4All’,-100,19);
Insert into trade_tbl Values(‘TRADE6’,’10:10:01',’ITJunction4All’,-300,19);

select t1.trade_id,t2.trade_id,t1.trade_timestamp,t1.price from trade_tbl t1 inner join trade_tbl t2 on t1.trade_id<>t2.trade_id AND t1.trade_timestamp < t2.trade_timestamp where DATEDIFF(second,t1.trade_timestamp,t2.trade_timestamp)<=10 and abs((t1.price-t2.price)*100/t1.price)>10 order by t1.trade_id

Note: This IE list will keep on updating.

Thanks for the read🙏🙏.Do clap👏👏 if find it useful.🙂🙂

“Keep learning and keep sharing knowledge.”

--

--

Nidhi Gupta

Azure Data Engineer 👨‍💻.Heading towards cloud technologies expertise✌️.