Telematics information has been flowing from our assets to Caterpillar via email, satellite, cell tower, and direct connect for over 20 years. Our systems have morphed from a single Unix box to the Azure cloud, from Oracle to Azure Table Storage to SQL Server to HBase/Phoenix, and from 10 to 500 messages a second. This presentation will track where we came from and, more specifically, our current system of Azure event hubs, Storm topologies, Phoenix backend, and streaming with Spark.
In this session, learn all about the Caterpillar journey, from where we came from to where we are today—including lessons learned along the way. This presentation is aimed at those wanting to understand the interrelationship between change, technology, and platform decisions. In addition, we will show how modern tools can dramatically reduce the complexity and time associated with IoT solution deployment. The Audience should leave the session feeling that anyone can do IoT. Current tools are easier, faster, and better than ever. MARK JUCHEMS, Digital Technical Specialist, Caterpillar and JUSTIN RICE.
2. Caterpillar: Confidential Green
Mark Juchems
• Born and raised in Illinois
• Studied Bible Theology – Greek at Moody Bible
Institute
• 20 years of Java development at Caterpillar
– 3 Years Apache Storm development
• Fun Facts:
– I love my wife!
– I love racing cars.
– 3 kids!
Caterpillar Confidential: Green
3. Caterpillar: Confidential Green
Justin Rice
• Born and raised in Pennsylvania
• Studied Computer Science at Penn State
• 5 years of Java and .NET development at Caterpillar
– 3 Years Apache Storm development
– 2 Years Custom Predictive Analytics
• Fun Facts:
– Designed and developed an embedded application for synchronizing fireworks with music
– Like to golf and play volleyball, not any good at either!
– Love technology, especially smart home devices
6. Caterpillar: Confidential Green
Cat® Product Link™
• Redefines fleet management effectiveness. It transmits the
information via cell and satellite. Product Link™ connects customers
to our world-class dealer network. Product Link gets you accurate,
timely and useful information about the location, utilization and
condition of your equipment—the kind of information that can make
a huge difference in the efficiency and costs of your entire operation.
It pays to know the Cat® Product Link.
8. Caterpillar: Confidential Green
Gen 1: PL3xx
November 1995
• ORBCOMM
• Dulles, VA - Satellite Services Provider
• RACOTEK
• Minneapolis, MN - Middleware/Infrastructure Experts
• TORREY SCIENCE
• San Diego, CA - Hardware Vendor
• ADVANCED SYSTEM DESIGNS
• Morton, IL - Back-end Experts
9. Caterpillar: Confidential Green
Gen 1: PL3xx
Vision:
The high level requirement was to create a cost effective
data communications system that could transmit engine
status, warning and fault information over multiple
wireless networks to Caterpillar dealers and customers.
14. Caterpillar: Confidential Green
Gen 1: PL3xx
• 2003 - CWI 2.0 rolled out - still in use today. First
message handler.
• 20 - 50ps messages. Unix box running Java 1.4
• Uses Email to communicate.
16. Caterpillar: Confidential Green
Gen 1: PL3xx
I apologize for this hellacious mess.
If you have to work on it, please look me up and I will buy
you a coke.
…
He was right: On the other hand, he DID buy me a Coke.
20. Caterpillar: Confidential Green
Gen 2: PL6xx
• A new Message handler using “modern” technology.
• 25 New Message Types.
• My background was web-aps using Spring, Hibernate, Tomcat, Jersey and REST.
• Was afraid of Threads, MQ, and Complexity.
• I fell into the best job ever.
• “Most great works have been made by one mind. The exceptions have been made by two
minds. And two is indeed a magic number for collaborations; marriage was a brilliant
invention and has a lot to be said for it.” Fredrick Brooks, The Design of Design. Pg. 65.
• I had a boss that completely trusted me for good or for ill.
• I wrote the Message Handler and Matt Ledger wrote the APIs.
29. Caterpillar: Confidential Green
Gen 2: PL6xx
• The good:
• SQL database
• Spring
• No Dead Letter Queue.
• Jquery mobile
• The struggles:
• Jersey
• Versioning
• Functional Testing Team
• Hibernate?
• MQ
31. Caterpillar: Confidential Green
Gen 3: Consolidation
• First push for consolidation of devices in a single platform
– Including joint venture devices pl4xx and pl5xx
• How do we scale from Gen 2 with the same technologies
• Manual approach to distributed computing
• Pub Sub approach rather than database coupling
• Cloud move not yet confirmed
33. Caterpillar: Confidential Green
Gen 3: Consolidation
• Pros
– Data wasn’t coupled solely to Sql, other applications could leverage the topic for real time
capabilities
– Independent scaling of message processing and persistence layer
• Cons
– MQ Throughput became a bottleneck
– Server management more cumbersome
34. Caterpillar: Confidential Green
Gen 3: Consolidation
• Java 1.6
• Spring 3.0.6
• Hibernate
• Ehcache
• Jersey 1.17
• Jmock
• JQuery UI
• Coda Hale Metrics
• Avro
40. Caterpillar: Confidential Green
Topology Strengths
• We can add a new file type in 1 day.
• It is relatively cheap.
• Our code is debugged.
• Flexible – easy to add bolts and such.
• Great Logging
• Storm UI
• Runs locally
• Fast
41. Caterpillar: Confidential Green
Storm
Challenges
• Black Box.
• Documentation.
• Adding insignificant code can have enormous effect on
throughput.
– Cloner
– Caffeine cache
– Topologies affect each other.
– Only ack once!
• Examples!
• Event Hubs.
• Odd things help performance.
• Talent
• Functional test
Caterpillar Confidential: Green
43. Caterpillar: Confidential Green
Batching
• Single inserts to Phoenix/HBase are “slow”
• Leveraged Storm’s Tick Tuple and “Buckets” to micro
batch.
• Increased performance drastically
• Fully configurable
• Can be used for Databases or Event Hubs.
45. Caterpillar: Confidential Green
Persistence
• Database:
– Azure Table storage – Sorting Capabilities
– Mongo DB / Azure Document DB – Speed
– HBase - Phoenix – Ingestion/Speed
– Sql. – Speed/Tuning.
– Hive/Spark – Production Testing
46. Caterpillar: Confidential Green
Hbase Troubles
• Hot Spotting
– Phoenix index was created on Record Create Timestamp.
– Single Region Server.
– Rebuild the indexes of 12 tables with Salting enabled.
• Stability
– Restarts.
– Inconsistencies.
• Documentation
– .NET Support
• Performance
– Select performance.
47. Caterpillar: Confidential Green
Persistence Fun Facts
• Storm data access objects only do Inserts. Sql created with custom annotations. No
Hibernate or Transactions.
• Streaming uses JSON.
– Added fields are versioned.
– We don’t remove fields.
– We add at will. Some groups “listen” to our Markdown files.
• Hbase tables created with additional unused columns so that Phoenix indexes did not need
to be rebuilt every time there was a change.
• Initially used long column names “dieselExhaustFluidTankLevel”. This name is saved on
every row of data in Hbase. Changed it to (K0,K1,A,B,C,X6,X4,X2,X5,V4,
D1,D2,D3,D4,B1,B2,B3,B4,V1,V2,V3,T1,T2).
49. Caterpillar: Confidential Green
Redis
– Heavily used for device to asset relationship, subscriber info, enrichment data.
– Excellent performance for real time processing.
– Ease of use
50. Caterpillar: Confidential Green
Queue Depths
• MQ Queue depths in Event Hubs?!
• How do we monitor our position in an event hub?
• Leverage EventHubSpout ack method to notify
Redis of current Sequence Number.
• Compare the max sequence number of the
event hub with Storms position in Redis.
• Running Cron Jobs from Spout Threads
52. Caterpillar: Confidential Green
The Crux of Fault Tolerance
• Never lose a message, but...
– Replaying message that wasn’t meant to replay
– Difficult to determine if a message is replaying
– Queue depth implementation didn’t catch it
– Resulted in a 12 hour processing delay
• No Dead Letter queue
– We archive everything for 60 days.
– We do get bad messages.
– Don’t write for exceptions.
53. Caterpillar: Confidential Green
Auditing
• Storm of Storms (Hurricane?)
• Each topology that interacts with a
message has the capability to
send info about that message to
an audit event hub
• The data is then saved, using
Storm, in HBase and is query able
via a dashboard. 30 Day TTL, 2 TB
of Audits
54. Caterpillar: Confidential Green
How the cloud has made our jobs/apps better
• Easily spin up new technology
• Connections to databases are trivial.
• Proof of concepts are easy.
• Powerful hardware (Cat already had this, but so does the cloud)
• Access to the command lines. (more fun!)
• Cheap
• New technologies
• More access to experts.
• Analytics galore! We know what is happening IN our system. We can analyze our data
easily.
55. Caterpillar: Confidential Green
What’s this all for?
REST APIs
AssetIds
Asset Structures
Assets
Basic Daily – Diagnostic
Basic Daily – Fuel
Basic Daily – SMU
Cumulatives
Devices
ECMs
Engine Start Stops
Fault Codes
Fuels
Loads
Locations
Payloads
Time Series
Justin – 4 generations of message handling, 3 on premise finally made it to the cloud 4th gen
The Product vision: Cat Product Link circa 1995. This is still our product today. All our devices are PL devices.
- Justin The messages are sent from the asset to a satellite (orbcom) and then parsed into email format. At first there were only 2 satellites and they were only overhead maybe 5 minutes at a time. Remote mines used this system. Sometimes it would take several passes for the large messages to get transmitted. The cell format has since been retired (I think it was 2G) We still sell these to remote mines using Satellite communication. Talk about the types of messages we receive, the frequency of some over others.
-Mark 4 teams contributed to the first get devices. Two I believe are out of business.
- Mark This is still our vision today. VisionLink is standard on most of our tractors today.
MArk
Mark
Justin PL201 – Every once in a while we find one of these still pinging us.
Justin I was 8 years old! PL121 radio. Pl300 gateway
PL201 sells 94 units in one year.
PL1011000 units sold total.
Picture: All this is a pl321
Justin – What does CWI stand for
Mark Peek into the weeds of code. This code is 15 years old and is still running.
Mark
Mark
Mark elaborate on new message types, i.e. emissions stuff DEF and tire pressure
Outsourced MMS Gateway, Michigan and Italy
Mark Why XML? Schema. Many groups were going to use these messages and we wanted a standardized format. Json was too freeform.
Mark
Mark Designed to scale out.
Coding began in October of 2012.
Mark 300 messages per second
Justin After trying every database you can think of for the newest version of this, using only a full SQL database was refreshing. Maturity of SQL, rich documentation, abundance of frameworks/ORM’s
Don’t mention the Oracle!
Mark Very simple design. I didn’t use a Dead-Letter queue. A message had one pass through our system, although if Oracle was down, we did reprocess those. The asset would send binary messages to the Cat Gateway which would be turned into XML (string parsing only) and then dumped on an MQ. 4 servers read from the MQ and save it to Oracle.
Mark Architecture
Goal: Roll out of bed supportable.
Mark We had an extra button, so we created a side hustle. Always add something creative, always something blinking. Sparkle lines, throughput per hour
Mark
Lots of unit tests. The strongest part of our code. Fast changes.
Also, one designer. Makes everything easy. (The Design of Design).
Functional testing team.
Defensive programming
Mark Be careful with what you design. Sometimes it gets out into the wild.
Matt and I wanted good docs so whenever someone asked a question we would answer it via HTML (above).
Mark SQL Database is very fast. Each message does 3 database calls to get registration data + whatever is in the message. Up to 6 total. Very fast.
Spring makes everything easy. Never any connection problems.
No dead letter q – if the databases were up, there was no reason to try again. If the databases are down, we reprocess til we get it saved.
Teams had a tough time with Jersey. Validation was a mess
Versioning REST services was tough (we used accept/content-type headers with all the associated problems.)
Functional testing was cumbersome – the team would not automate anything.
MQ team was great – tools were great! – MQ was just overwhelmed with messages.
Justin Bringing it all together, CWI, PL4xx5xx, PL6xx before cloud merge, vision was to be able to enable a data visualization layer with all devices
Justin Designed for data viz, not a generic application. Avro standard format. Spike Spark, really new and the learning curve didn’t meet our timeline. On premise Hadoop spark instance.
Justin talk a bit about how this approach was similar to storm, best we could do with the technologies at hand
Justin, talk a little about how we beat up on our mq team for 13ms write speeds but going to azure we found that queues and topic were much slower thus we used event hubs
Justin metric systems, some of it stolen from Gen 2. Required a lot of manual metric tracking, app insights made this process much much easier.
Mark/Justin Towards the end of Gen 3 development cloud was approved. On premise technologies limited, didn’t have the ability to experiment with new technologies. Almost moved to another team because of the limiting tech for our skillsets. Cloud announcement was extremely exciting, team morale was great. Devops and Continuous integration.
Starting to move gen 3 and gen 2 to offshore devs. Rearrangged teams to fit a devops model, pizza sized teams.
Mark Our management had one main goal: handle 2 million assets.
This seemed familiar because of the experience Justin and Matt had with Gen3. By accident.
Excited about no sql databases
Didn’t believe on prem systems could handle 2 million assets.
Jeff and Sam did a bake off between storm and azure web jobs. Settled on storm for fault tolerancy and speed and the ability to scale.
Mark
Jeff and Sam did a bake off between storm and C# web jobs. Settled on storm for fault tolerancy and speed and the ability to scale.
Our management had one main goal: handle 2 million assets.
This seemed familiar because of the experience Justin and Matt had with Gen3. By accident.
Excited about no sql databases
Didn’t believe on prem systems could handle 2 million assets.
Mark
Transistioned to linux when we moved to phoenix
Justin
Mark
Justin
Black Box
Lack of documentation and “real” answers. (try this!)
Adding insignificant code can have enormous effect on throughput.
Cloner
Caffeine cache
Topologies affect each other.
Only ack once!
No real world examples! (that is why we are here)
Azure documentation has been a challenge. Interfacing with EventHubs has been troublesome.
Odd things help performance (reduce threads?)
Difficult to find Storm developers
Difficult to functional test
Mark We essentially have a mapping machine. We map from bytes to XML to DTO to Database/Json object.
Justin Our first database in the cloud was Table Storage. We started inserting one at a time. Then we batched 10 at a time. Then we used Sql Server, then Phoenix/Hbase.
I was the only one on the team against it, because of the complexities it added.
Mark Documentation is again a major priority. We code up the tables in Java and add documentation there. When our Junits run it creates Mark Down which is published on the web.
We even add some sql at the bottom for us to use.
Justin We have tried every conceivable database. Azure has to offer. Other teams had used mongo and docdb but wasn’t fast enough. Phoenix was brand new completely bleeding edge. Difficult to find documentation, .NET support wasn’t available at the time of use. Summarization from web job to spark/hive and the struggles we had.
Azure Table storage – Couldn’t sort in any way.
Mongo DB / Azure Document DB – Slow. Never got far.
HBase - Phoenix – Ingest Is really fast! No DBAs, cryptic documentation, can’t delete data (now we TTL), select won’t meet our SLA.
Azure Sql Server. – Impressed with it’s speed of ingestion. Took us 6 months to copy a database. Has been time consuming to tune.
Hive – Used with Spark and gave up. No testing tools, no Unit tests, very bad experience. Although others at this conference speak well of this, we could not duplicate the excitement.
Justin
Hot Spotting
Phoenix index was created on Record Create Timestamp (Terrible Design).
Outcome, production outage due to extreme load on a single Region Server.
Had to rebuild the indexes of 12 tables with Salting enabled.
Stability
Long restarts would block writes.
Restarts would cause inconsistencies and would need constant maintenance.
Documentation
.NET Support limited
Performance
Select performance under heavy load has been poor
Mark We tried to use Pooling on Phoenix and it only caused trouble. NoSql databases are not Sql databases!
Long data names – k1,k2,a,b,c,d
Mark Mention Lettuce – Redis
Mark wanted to do the entire app in REDIS but it can’t sort. Very positive experience.
Mark We use REDIS mainly as an interface between systems. Our Registration information is all in REDIS. It is put there by one team and then consumed by many others.
Justin. This was something we could not find on the web. Jeff figured this out and it is very useful. Not sure how you can work without it.
Mark
Experienced with Spring but struggled to get it working correctly within Storm’s distributed nature.
Settled on Google “Guice”
Not as elegant
A lot of boiler plate code
Too much of a hassle to switch
Justin
Justin Great for support, gives us the ability to track messages anywhere in the system.