PHP Continuous Data Processing
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


PHP Continuous Data Processing






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Imagine viewing a customers fleet of 30 vehicles on a map? 60 queries refreshing every 30 seconds

PHP Continuous Data Processing Presentation Transcript

  • 1. PHP & Continuous Data Processing
    Michael Peacock, October, 2011
  • 2. No. Not milk floats (anymore)
    All Electric, Commercial Vehicles.
    Photo courtesy of kenjonbro:
  • 3. About Michael Peacock
    • Senior/Lead Web Developer
    • 4. Web Systems Developer
    • 5. Telemetry Team – Smith Electric Vehicles US Corp
    • 6. Author
    • 7. PHP 5 Social Networking, PHP 5 E-Commerce Development, Drupal Social Networking (6 & 7), Selling online with Drupal e-Commerce, Building Websites with TYPO3
    • 8. PHPNE Volunteer
    • 9. Occasional technical speaker
    • 10. PHP North-East, PHPNW 2010, SuperMondays, PHPNW 2011 Unconference, ConFoo 2012
  • Smith Electric Vehicles & Telemetry
    Worlds largest manufacturer of Commercial, all-electric vehicles
    Smith Link – on-board vehicle telematics system, capturing over 2500 data points each second on the vehicle and broadcasting them over mobile network
    ~400 telemetry enabled vehicles on the road
    Worlds largest telemetry project outside of F1
  • 11. System Architecture
  • 12. System Architecture
  • 13. Problem #1: We Can’t Loose Any Data
    Data is required as part of a $32 million grant from the US Department of Energy
    • Thousands of pieces of information collected on a per second basis from a range of remote collection devices
    • 14. Un-predictable amounts of data at any one time
    • 15. More vehicles rolling off the production line with telemetry enabled
    • 16. What about system downtime, upgrades, roll-outs and connectivity problems?
  • Message Queuing
    Solution: We use a fast, reliable, scalable, secure, hosted message queue
    • If our systems are offline, data builds up in the external message queue
    • 17. If we are processing at full capacity, surplus builds in in the message queue
    • 18. If the vehicle loses GPRS signal, or message queue were to be inaccessible, vehicles have an internal buffer of up to 7 days
  • Secret Weapon #1: StormMQ
    • Based on AMQP, an open standard
    • 19. Secure: All data is encrypted and sent over SSL
    • 20. Reliable: Huge investment in server infrastructure
    • 21. Hosted: Backed up with an SLA
    • 22. Scalable: Capable of processing huge numbers of incoming messages, with capacity to store the messages when we perform maintenance on our systems
  • Problem #2: Processing data quickly
    We utilise a dedicated server and number of dedicated applications to pull these messages and process them
    • This needs to happen quick enough for live data to be seen through the web interface
    • 23. Data is rapidly converted into batch SQL files, which are imported to MySQL via “LOAD DATA INFILE”
    • 24. Results in high number of inserts per second (20,000 – 80,000)
    • 25. LOAD DATA INFILE isn’t enough on its own...
  • Secret Weapon #2: DBA
    Sam Lambert – DBA Extraordinaire
    • Constantly tweaking the servers and configuration to get more and more performance
    • 26. Pushing the capabilities of our SAN, tweaking configs where no DBA has gone before
    • 27.
    • 28.
    • 29.
    • 30.
  • Sharding
    • Huge volumes of data being stored
    • 31. We shard the data based on the truck it came from, each truck has its own database
    • 32. Databases held on one of many database servers in our cluster each with ~100GB RAM
  • Live, Real Time Information
    [live screen photo]
  • 33. Real Time Status and Tracking
  • 34. Live, Real Time Information: Problem
    Original database design dictated:
    • All data-points were stored in the same table
    • 35. Each type of data point required a separate query, sub-query or join to obtain
    Workings of the remote device collecting the data, and the processing server, dictated:
    • GPS Co-ordinates can be up to 6 separate data points, including: Longitude; Latitude; Altitude; Speed; Number of Satellites used to get location; Direction
  • Real Time Information: Concurrent
    Initial Solution from the original developers:
    • Pull as many pieces of real time information through asynchronously
    • 36. Involved the use of Flash based “widgets” which called separate PHP scripts to query the data
    • 37. Pages loaded relatively quickly
    • 38. Data points took a little time to load
    • 39. Not good enough
  • Real Time Information: Caching
    • High volumes of data, and varying levels of concurrent processing means query times are often not consistent
    • 40. Memcachewas used when processing the data from the message queue, keeping a copy of the most recent of each data point for each truck
    • 41. Live, Real-Time information accessed directly from memcache, bypassing the database
  • Caching: Registry/DI is Ideal
    • Sporadic use of memcache within the web application – ideal use case for a lazy loading registry or DI container
    • 42. Give the registry or container details of memcache
    • 43. Object only instantiated and connection made only when data is requested from memcache
  • Lazy Loading
    public function getObject( $key )
    if( in_array( $key, array_keys( $this->objects ) ) )
    return $this->objects[$key];
    elseif( in_array( $key, array_keys( $this->objectSetup ) ) )
    if( ! is_null( $this->objectSetup[ $key ]['abstract'] ) )
    require_once( FRAMEWORK_PATH . 'registry/aspects/' . $this->objectSetup[ $key ]['folder'] . '/' . $this->objectSetup[ $key ]['abstract'] .'.abstract.php' );
    require_once( FRAMEWORK_PATH . 'registry/aspects/' . $this->objectSetup[ $key ]['folder'] . '/' . $this- >objectSetup[ $key ]['file'] . '.class.php' );
    $o = new $this->objectSetup[ $key ]['class']( $this );
    $this->storeObject( $o, $key );
    return $o;
    elseif( $key == 'memcache' )
    // requesting memcache for the first time, instantiate, connect, store and return
    $mc = new Memcache();
    $this->storeObject( $mc, 'memcache' );
    return $mc;
    Becomes the limit for the registry pattern, DI container more suitable
  • 44. Real Time Information: Extrapolate and Assume
    • Our telemetry unit broadcasts each data point once per second
    • 45. Data doesn’t change every second, e.g.
    • 46. Battery state of charge may take several minutes to loose a percentage point
    • 47. Fault flags only change to 1 when there is a fault
    • 48. Make an assumption.
    • 49. We compare the data to the last known value…if it’s the same we don’t insert, instead we assume it was the same
    • 50. Unfortunately, this requires us to put additional checks and balances in place
  • Extrapolate and Assume: “Interlation”
    Built a special library which:
    • Accepted a number of arrays, each representing a collection of data points for one variable on the truck
    • 51. Used key indicators and time differences to work out if/when the truck was off, and extrapolation should stop
    • 52. For each time data was recorded, pull down data for other variables for consistency
  • Interlace
    * Add an array to the interlation
    public function addArray( $name, $array )
    * Get the time that we first receive data in one of our arrays
    public function getFirst( $field )
    * Get the time that we last received data in any of our arrays
    public function getLast( $field )
    * Generate the interlaced array
    public function generate( $keyField, $valueField)
    * Beak the interlaced array down into seperate days
    public function dayBreak( $interlationArray)
    * Generate an interlaced array and fill for all timestamps withinthe range of _first_ to _last_
    public function generateAndFill( $keyField, $valueField)
    * Populate the new combined array with key fields using the common field
    public function populateKeysFromField( $field, $valueField=null )
  • 53. Real Time Information: Single Request
    • Currently, each piece of “live data” is loaded into a flash graph or widget, which updates every 30 seconds using an AJAX request
    • 54. The move from MySQL to Memcache reduces database load, but large number of requests still add strain to web server
    • 55. Moving to image and JavaScript widgets, which are updated from a single AJAX request
  • Lots of Data: Race Conditions
    Sessions in PHP close at the end of the execution cycle
    • Unpredictable query times
    • 56. Large number of concurrent requests per screen
    Session Locking
    Completely locks out a users session, as PHP hasn’t closed the session
  • 57. Race Conditions: PHP & Sessions
    Added after each write to the $_SESSION array. Closes the current session.
    (requires a call to session_start immediately before any further reads or writes)
  • 58. Race Conditions: Use a ******* Template Engine
    • V1 of the system mixed PHP and HTML 
    • 59. You can’t re-initialise your session once output has been sent
    • 60. All new code uses a template engine, so session interaction has no bearing on output. When the template is processed and output, all database and session work has been completed long before.
  • Race Conditions: Use a Single Entry Point
    • Race conditions are further exacerbated by the PHP timeout values
    • 61. Certain exports, actions and processes take longer than 30 seconds, so the default execution time is longer
    • 62. Initially the project lacked a single entry point, and execution flow was muddled
    • 63. Single Entry Point makes it easier to enforce a lower time out, which is overridden by intensive controllers or models
  • Intensive queries & Calculations
    • How far did this vehicle travel?
    • 64. Motor RPM x Various vehicle specific constants
    • 65. Calculated for every RPM value held during drive process
    • 66. How much energy did the vehicle use
    • 67. Battery Current x Battery Voltage x Time
    • 68. For every current and voltage value combination held during the driving process
    • 69. How well was the vehicle driven
    • 70. Analysis of idle time
    • 71. Harshness of accelerator and brake pedal usage
    • 72. Inappropriate duration of AC / Heater on time?
    • 73. What about for a customers fleet, or all of our vehicles sold?
  • Intensive Queries & Calculations
  • 74. Intensive queries & Calculations
    • Involves a fair number of queries per vehicle
    • 75. Calculations involve holding this data in memory
    • 76. Processing required for every single record for that piece of data during that day
    Takes a while!
    • Calculate information overnight
    • 77. Save it as a compiled report
    • 78. Lookups and comparisons only need to look at the compiled / saved reports in the database
  • Reports
    In addition to our calculated reports, we also need to export key bits of information to grant authorities
    • Initially our PHP based export scripts held one database connection per database (~400 databases)
    • 79. Re-wrote to maintain only one connection per server, and switch the database used
    • 80. Toggles to instruct the export to only apply for 1 of the servers at a time
    • 81. Modulus magic to run multiple export scripts per server
  • Triggers and Events
    Currently a work-in-progress R&D project, evaluating two options:
    • Golden hammer: Use PHP
    • 82. Run PHP as a daemon
    • 83.
    • 84. Continually monitor for specific changes to memcache variables
    • 85. Node.js
    • 86. Light weight and fast
    • 87. Give PHP another friend
    • 88. Link into PHP based API to run triggers
  • The Future
    • More sharding
    • 89. Based on time – keep the individual tables smaller
    • 90. NoSQL?
    • 91. Currently investigating NoSQL solutions as alternatives
    • 92. Rationalisation
    • 93. Do we need as much data as we collect?
    • 94. Abstraction
    • 95. We need to continually abstract concepts and ideas to make on-going maintenance and expansion easier; especially in terms of mapping code to database shards
    • 96. More hardware
    • 97. Expand our DB cluster, more RAM, R&D
    • 98. Design
    • 99. A much needed design refresh
  • Conclusions
    • Make the solution scalable from the start
    • 100. Where data collection is critical, use a message queue, ideally hosted or “cloud based”
    • 101. Hire a genius DBA to push your database engine
    • 102. Make use of data caching systems to reduce strain on the database
    • 103. Calculations and post-processing should be done during dead time and automated
    • 104. Add more tools to your toolbox – PHP needs lots of friends in these situations
    • 105. Watch out for Session race conditions: where they can’t be avoided, use session_write_close, a template engine and a single entry point
    • 106. Reduce the number of continuous AJAX calls
  • Q & A
    Michael Peacock
    Web Systems Developer – Telemetry Team – Smith Electric Vehicles US Corp
    Senior / Lead Developer, Author & Entrepreneur
     Extra information!