The rice and fail of an IoT solution

•Download as PPTX, PDF•

0 likes•57 views

Radu Vunvulea

Technology

RADU VUNVULEA MCTS MCP BANK HOME AUTOMATION
MVP ENTERPRISE
AUTOMOTIVE PHARMA
LEAN AND AGILE E-COMMERCE
WEB iQuest
AZURE JAVASCRIPT VUNVULEARADU.BLOGSPOT.COM
MOBILE DOTNET @RaduVunvulea
WCF WPF ENTHUSIASTIC

20.000 AWS VMs used to simulate the load
250.000 RPS (normal load)
100.000 device registered and active in 15 minutes
400.000 file of 5 MB uploaded in 30 minutes
102M of commands send to devices in 5 hours
101.7M of commands processed by devices in 5 hours
9 M of “I’m alive” events every 5 minutes (80GB/h)
Load Test Output

Understand what are the real problem that
we will need to resolve when we write our
own IoT framework

• Input:
– Processing an event takes to long (>1s)
– Number of events per second – >13.000 events/s
• Worker Roles
– ~100 events in parallel per instance
– 13.000/100 = 130 instances (theoretically)
– Max. number of partitions on Event Hub is 32
Problem

• Two types of events
– 95% were only Heartbeats
– 5% were other types of events
Investigation

• Input:
– How to process 13.000 events/s
– We don’t need real time processing
– All the input data is stored in Azure Storage
Problem

• Input:
– Processing an event takes to long (>100ms)
– CPU level is high
– There are times when the system freeze
Problem

• Input:
– Processing an event takes to long (>100ms)
– CPU level is high
– There are times when system freeze
– ~1000 events/s on each Worker Role instance
• Even with batch processing
Problem

• Bottleneck is the logger
• Writing logs is very expensive
• Having a high throughput and in the same time
to have a very good logging level activated is
impossible
Investigation

Postulate:
–Once an event is consumed from Event Hub is
not removed
We can reset the cursor as many time we want
We can analyze and process the same events over
and over again
Event Hub

• Thread-Safe
• Multi-process
• Checkpoint
• Partition Lease management
Event Hub Processor

• Input:
– During the load we started to see a lot of
Throttling Exception - “quota exceeded exception”
Problem

• Throughput Units (TU) under the same
namespace as shared between the Event Hubs
from the same namespace
Event Hub and Namespaces

• Input:
– The size of an event is 256 KB
– Unit of measure is 64 KB
– How we should handle events with payload bigger
than 256 KB or 64 KB
Problem

• Input:
– Things can go wrong
– Azure Event Hub or an Azure Datacenter (Regions) can
go down for a short period of time or we can even
lose connection (our cause)
– How we can define and create a failover mechanism
for cover this use cases
Problem

• Input:
– Internal review & external review arch. review reviled that
Service Bus Topic/Queues are not recommended for this uses
case
Problem

• Input:
– Redis Cache is extremely fast (>120.000 reads/s)
– … but… when you have a lot of writes also…
is not so fast as you expect
– Latency for read operations went up (>2s)
Problem

• Input:
– Even we scale a WebApp to multiple instances when you have a load of
more than 5.000 requests per second…. system will not behave as you
expect
• You can even discover that you WebApp was suspended and the only
thing that you can do is to make a new deploy and delete the existing
one
Problem

• Grouping resources together and defining the
quality attributes of that scaling unit
• When the limits are hit
another scaling unit is
added, without adding
more resources to the
current one(s)
Scaling Unit

Does it make sense to create your
own system or to use an existing
one?

What's hot

Deep Dive on Elastic Load BalancingAmazon Web Services

(WEB305) Migrating Your Website to AWS | AWS re:Invent 2014Amazon Web Services

PaaSing Your Code AroundChris Tankersley

Auto scaling applications in 10 minutes (CakeFest 2013)Juan Basso

How to Troubleshoot & Optimize Database Query Performance for Your ApplicationDynatrace

Webinar: Queues with RabbitMQ - Lorna MitchellCodemotion

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaAkara Sucharitakul

SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services

Build Your Own Performance Test Lab in the CloudTechWell

(WEB307) Scalable Site Management Using AWS OpsWorks | AWS re:Invent 2014Amazon Web Services

AWS Meetup - Nordstrom Data Lab and the AWS CloudNordstromDataLab

(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...Amazon Web Services

Nordstrom Data Lab Recommendo API with Node.jsDavid Von Lehman

Microservices: 5 things I wish I'd known - Vincent Kok - Codemotion Amsterdam...Codemotion

Configuration Management in the Cloud - AWS Online Tech TalksAmazon Web Services

Deep Dive on Amazon EC2Amazon Web Services

Micrsoservices unleashed with containers and ECSAmazon Web Services

AWS Webcast - Getting Started with Amazon Web ServicesAmazon Web Services

How to build a SaaS solution in 60 daysBrett McLain

Zero Downtime JEE ArchitecturesAlexander Penev

What's hot (20)

Deep Dive on Elastic Load Balancing

(WEB305) Migrating Your Website to AWS | AWS re:Invent 2014

PaaSing Your Code Around

Auto scaling applications in 10 minutes (CakeFest 2013)

How to Troubleshoot & Optimize Database Query Performance for Your Application

Webinar: Queues with RabbitMQ - Lorna Mitchell

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...

Build Your Own Performance Test Lab in the Cloud

(WEB307) Scalable Site Management Using AWS OpsWorks | AWS re:Invent 2014

AWS Meetup - Nordstrom Data Lab and the AWS Cloud

(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...

Nordstrom Data Lab Recommendo API with Node.js

Microservices: 5 things I wish I'd known - Vincent Kok - Codemotion Amsterdam...

Configuration Management in the Cloud - AWS Online Tech Talks

Deep Dive on Amazon EC2

Micrsoservices unleashed with containers and ECS

AWS Webcast - Getting Started with Amazon Web Services

How to build a SaaS solution in 60 days

Zero Downtime JEE Architectures

Similar to The rice and fail of an IoT solution

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services

Performance architecture for cloud connectAdrian Cockcroft

AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly

Storm - SpaaSErnestas Vaiciukevicius

Azure stream analytics by Nico JacobsITProceed

Finding an unusual cause of max_user_connections in MySQLOlivier Doucet

Scaling on AWS to the First 10 Million Users mauerbac

More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services

Fixing twitterRoger Xia

Fixing_Twitterliujianrong

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight

Otimizando servidores webAmazon Web Services LATAM

Scaling on AWS for the First 10 Million UsersAmazon Web Services

Cloud Security Monitoring and Spark Analyticsamesar0

Why Scale Matters and How the Cloud is Really Different (at scale)Amazon Web Services

Building a Just-in-Time Application Stack for AnalystsAvere Systems

More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services

AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)Amazon Web Services

Stream Computing (The Engineer's Perspective)Ilya Ganelin

Similar to The rice and fail of an IoT solution (20)

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...

Performance architecture for cloud connect

AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly

Storm - SpaaS

Azure stream analytics by Nico Jacobs

Finding an unusual cause of max_user_connections in MySQL

Scaling on AWS to the First 10 Million Users

More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...

Fixing twitter

Fixing_Twitter

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...

Otimizando servidores web

Scaling on AWS for the First 10 Million Users

Cloud Security Monitoring and Spark Analytics

Why Scale Matters and How the Cloud is Really Different (at scale)

Building a Just-in-Time Application Stack for Analysts

More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...

AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)

Stream Computing (The Engineer's Perspective)

Recently uploaded

AI as an Interface for Commercial BuildingsMemoori

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Artificial intelligence in the post-deep learning eraDeakin University

Understanding the Laravel MVC ArchitecturePixlogix Infotech

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

CloudStudio User manual (basic edition):comworks

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Slack Application Development 101 Slidespraypatel2

Install Stable Diffusion in windows machinePadma Pradeep

How to Remove Document Management Hurdles with X-Docs?XfilesPro

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Recently uploaded (20)

AI as an Interface for Commercial Buildings

Pigging Solutions in Pet Food Manufacturing

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

08448380779 Call Girls In Friends Colony Women Seeking Men

SQL Database Design For Developers at php[tek] 2024

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Artificial intelligence in the post-deep learning era

Understanding the Laravel MVC Architecture

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Advanced Test Driven-Development @ php[tek] 2024

CloudStudio User manual (basic edition):

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Slack Application Development 101 Slides

Install Stable Diffusion in windows machine

How to Remove Document Management Hurdles with X-Docs?

08448380779 Call Girls In Civil Lines Women Seeking Men

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

The rice and fail of an IoT solution

1. THE RISE AND FAIL OF AN IOT SOLUTION @RADUVUNVULEA GROUP HEAD OF CLOUD, ENDAVA

2. How all started?

4. RADU VUNVULEA MCTS MCP BANK HOME AUTOMATION MVP ENTERPRISE AUTOMOTIVE PHARMA LEAN AND AGILE E-COMMERCE WEB iQuest AZURE JAVASCRIPT VUNVULEARADU.BLOGSPOT.COM MOBILE DOTNET @RaduVunvulea WCF WPF ENTHUSIASTIC

5. 20.000 AWS VMs used to simulate the load 250.000 RPS (normal load) 100.000 device registered and active in 15 minutes 400.000 file of 5 MB uploaded in 30 minutes 102M of commands send to devices in 5 hours 101.7M of commands processed by devices in 5 hours 9 M of “I’m alive” events every 5 minutes (80GB/h) Load Test Output

6. How all ended?

8. Understand what are the real problem that we will need to resolve when we write our own IoT framework

9. Out of scope

10. What we need?

11. Device Registration Ingest Egress AAA

12.

13. Ingest Analyze React Store

14. Ingest

15.

16. • Input: – Processing an event takes to long (>1s) – Number of events per second – >13.000 events/s • Worker Roles – ~100 events in parallel per instance – 13.000/100 = 130 instances (theoretically) – Max. number of partitions on Event Hub is 32 Problem

17. • Two types of events – 95% were only Heartbeats – 5% were other types of events Investigation

18.

19. • Input: – How to process 13.000 events/s – We don’t need real time processing – All the input data is stored in Azure Storage Problem

20.

21. • Input: – Processing an event takes to long (>100ms) – CPU level is high – There are times when the system freeze Problem

22. • Input: – Processing an event takes to long (>100ms) – CPU level is high – There are times when system freeze – ~1000 events/s on each Worker Role instance • Even with batch processing Problem

23. • Bottleneck is the logger • Writing logs is very expensive • Having a high throughput and in the same time to have a very good logging level activated is impossible Investigation

24.

25.

26. Postulate: –Once an event is consumed from Event Hub is not removed We can reset the cursor as many time we want We can analyze and process the same events over and over again Event Hub

27. • Thread-Safe • Multi-process • Checkpoint • Partition Lease management Event Hub Processor

28.

29. Partition Lease Management

30. • Input: – During the load we started to see a lot of Throttling Exception - “quota exceeded exception” Problem

31.

32.

33. • Throughput Units (TU) under the same namespace as shared between the Event Hubs from the same namespace Event Hub and Namespaces

34. • Input: – The size of an event is 256 KB – Unit of measure is 64 KB – How we should handle events with payload bigger than 256 KB or 64 KB Problem

35.

36. • Input: – Things can go wrong – Azure Event Hub or an Azure Datacenter (Regions) can go down for a short period of time or we can even lose connection (our cause) – How we can define and create a failover mechanism for cover this use cases Problem

37.

38.

39.

40.

41.

42.

43.

44. • Input: – Internal review & external review arch. review reviled that Service Bus Topic/Queues are not recommended for this uses case Problem

45.

46. • Input: – Redis Cache is extremely fast (>120.000 reads/s) – … but… when you have a lot of writes also… is not so fast as you expect – Latency for read operations went up (>2s) Problem

47.

48. • Input: – Even we scale a WebApp to multiple instances when you have a load of more than 5.000 requests per second…. system will not behave as you expect • You can even discover that you WebApp was suspended and the only thing that you can do is to make a new deploy and delete the existing one Problem

49.

50.

51. • Grouping resources together and defining the quality attributes of that scaling unit • When the limits are hit another scaling unit is added, without adding more resources to the current one(s) Scaling Unit

52. Scaling Unit

53. Conclusion

54. NOT

55. Does it make sense to create your own system or to use an existing one?

56.

57. Question Answers

58. THANK YOU FOR ATTENTION!