PHP Continuous Data Processing


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Imagine viewing a customers fleet of 30 vehicles on a map? 60 queries refreshing every 30 seconds
  • PHP Continuous Data Processing

    1. 1. PHP & Continuous Data Processing<br />Michael Peacock, October, 2011<br />
    2. 2. No. Not milk floats (anymore)<br />All Electric, Commercial Vehicles.<br />Photo courtesy of kenjonbro:<br />
    3. 3. About Michael Peacock<br /><ul><li>Senior/Lead Web Developer
    4. 4. Web Systems Developer
    5. 5. Telemetry Team – Smith Electric Vehicles US Corp
    6. 6. Author
    7. 7. PHP 5 Social Networking, PHP 5 E-Commerce Development, Drupal Social Networking (6 & 7), Selling online with Drupal e-Commerce, Building Websites with TYPO3
    8. 8. PHPNE Volunteer
    9. 9. Occasional technical speaker
    10. 10. PHP North-East, PHPNW 2010, SuperMondays, PHPNW 2011 Unconference, ConFoo 2012</li></li></ul><li>Smith Electric Vehicles & Telemetry <br />Worlds largest manufacturer of Commercial, all-electric vehicles<br />Smith Link – on-board vehicle telematics system, capturing over 2500 data points each second on the vehicle and broadcasting them over mobile network<br />~400 telemetry enabled vehicles on the road<br />Worlds largest telemetry project outside of F1<br />
    11. 11. System Architecture<br />
    12. 12. System Architecture<br />
    13. 13. Problem #1: We Can’t Loose Any Data<br />Data is required as part of a $32 million grant from the US Department of Energy<br /><ul><li>Thousands of pieces of information collected on a per second basis from a range of remote collection devices
    14. 14. Un-predictable amounts of data at any one time
    15. 15. More vehicles rolling off the production line with telemetry enabled
    16. 16. What about system downtime, upgrades, roll-outs and connectivity problems?</li></li></ul><li>Message Queuing<br />Solution: We use a fast, reliable, scalable, secure, hosted message queue<br /><ul><li>If our systems are offline, data builds up in the external message queue
    17. 17. If we are processing at full capacity, surplus builds in in the message queue
    18. 18. If the vehicle loses GPRS signal, or message queue were to be inaccessible, vehicles have an internal buffer of up to 7 days</li></li></ul><li>Secret Weapon #1: StormMQ<br /><ul><li>Based on AMQP, an open standard
    19. 19. Secure: All data is encrypted and sent over SSL
    20. 20. Reliable: Huge investment in server infrastructure
    21. 21. Hosted: Backed up with an SLA
    22. 22. Scalable: Capable of processing huge numbers of incoming messages, with capacity to store the messages when we perform maintenance on our systems</li></li></ul><li>Problem #2: Processing data quickly<br />We utilise a dedicated server and number of dedicated applications to pull these messages and process them<br /><ul><li>This needs to happen quick enough for live data to be seen through the web interface
    23. 23. Data is rapidly converted into batch SQL files, which are imported to MySQL via “LOAD DATA INFILE”
    24. 24. Results in high number of inserts per second (20,000 – 80,000)
    25. 25. LOAD DATA INFILE isn’t enough on its own...</li></li></ul><li>Secret Weapon #2: DBA<br />Sam Lambert – DBA Extraordinaire<br /><ul><li>Constantly tweaking the servers and configuration to get more and more performance
    26. 26. Pushing the capabilities of our SAN, tweaking configs where no DBA has gone before
    27. 27.
    28. 28.
    29. 29.
    30. 30.</li></li></ul><li>Sharding<br /><ul><li>Huge volumes of data being stored
    31. 31. We shard the data based on the truck it came from, each truck has its own database
    32. 32. Databases held on one of many database servers in our cluster each with ~100GB RAM</li></li></ul><li>Live, Real Time Information<br />[live screen photo]<br />
    33. 33. Real Time Status and Tracking<br />
    34. 34. Live, Real Time Information: Problem<br />Original database design dictated:<br /><ul><li>All data-points were stored in the same table
    35. 35. Each type of data point required a separate query, sub-query or join to obtain</li></ul>Workings of the remote device collecting the data, and the processing server, dictated:<br /><ul><li>GPS Co-ordinates can be up to 6 separate data points, including: Longitude; Latitude; Altitude; Speed; Number of Satellites used to get location; Direction</li></li></ul><li>Real Time Information: Concurrent<br />Initial Solution from the original developers:<br /><ul><li>Pull as many pieces of real time information through asynchronously
    36. 36. Involved the use of Flash based “widgets” which called separate PHP scripts to query the data
    37. 37. Pages loaded relatively quickly
    38. 38. Data points took a little time to load
    39. 39. Not good enough</li></li></ul><li>Real Time Information: Caching<br /><ul><li>High volumes of data, and varying levels of concurrent processing means query times are often not consistent
    40. 40. Memcachewas used when processing the data from the message queue, keeping a copy of the most recent of each data point for each truck
    41. 41. Live, Real-Time information accessed directly from memcache, bypassing the database</li></li></ul><li>Caching: Registry/DI is Ideal<br /><ul><li>Sporadic use of memcache within the web application – ideal use case for a lazy loading registry or DI container
    42. 42. Give the registry or container details of memcache
    43. 43. Object only instantiated and connection made only when data is requested from memcache</li></li></ul><li>Lazy Loading<br />public function getObject( $key )<br />{<br /> if( in_array( $key, array_keys( $this->objects ) ) )<br /> {<br /> return $this->objects[$key];<br /> }<br />elseif( in_array( $key, array_keys( $this->objectSetup ) ) )<br /> {<br /> if( ! is_null( $this->objectSetup[ $key ]['abstract'] ) )<br /> {<br />require_once( FRAMEWORK_PATH . 'registry/aspects/' . $this->objectSetup[ $key ]['folder'] . '/' . $this->objectSetup[ $key ]['abstract'] .'.abstract.php' );<br /> }<br />require_once( FRAMEWORK_PATH . 'registry/aspects/' . $this->objectSetup[ $key ]['folder'] . '/' . $this- >objectSetup[ $key ]['file'] . '.class.php' );<br /> $o = new $this->objectSetup[ $key ]['class']( $this );<br /> $this->storeObject( $o, $key );<br /> return $o;<br /> }<br />elseif( $key == 'memcache' )<br /> {<br /> // requesting memcache for the first time, instantiate, connect, store and return<br /> $mc = new Memcache();<br /> $mc->connect( MEMCACHE_SERVER, MEMCACHE_PORT );<br /> $this->storeObject( $mc, 'memcache' );<br /> return $mc;<br /> }<br />}<br />Becomes the limit for the registry pattern, DI container more suitable<br />
    44. 44. Real Time Information: Extrapolate and Assume<br /><ul><li>Our telemetry unit broadcasts each data point once per second
    45. 45. Data doesn’t change every second, e.g.
    46. 46. Battery state of charge may take several minutes to loose a percentage point
    47. 47. Fault flags only change to 1 when there is a fault
    48. 48. Make an assumption.
    49. 49. We compare the data to the last known value…if it’s the same we don’t insert, instead we assume it was the same
    50. 50. Unfortunately, this requires us to put additional checks and balances in place</li></li></ul><li>Extrapolate and Assume: “Interlation”<br />Built a special library which:<br /><ul><li>Accepted a number of arrays, each representing a collection of data points for one variable on the truck
    51. 51. Used key indicators and time differences to work out if/when the truck was off, and extrapolation should stop
    52. 52. For each time data was recorded, pull down data for other variables for consistency</li></li></ul><li>Interlace<br /> * Add an array to the interlation<br /> public function addArray( $name, $array )<br /> * Get the time that we first receive data in one of our arrays<br /> public function getFirst( $field )<br /> * Get the time that we last received data in any of our arrays<br /> public function getLast( $field )<br /> * Generate the interlaced array<br /> public function generate( $keyField, $valueField)<br /> * Beak the interlaced array down into seperate days<br /> public function dayBreak( $interlationArray)<br /> * Generate an interlaced array and fill for all timestamps withinthe range of _first_ to _last_<br /> public function generateAndFill( $keyField, $valueField)<br /> * Populate the new combined array with key fields using the common field<br /> public function populateKeysFromField( $field, $valueField=null )<br /><br />
    53. 53. Real Time Information: Single Request<br /><ul><li>Currently, each piece of “live data” is loaded into a flash graph or widget, which updates every 30 seconds using an AJAX request
    54. 54. The move from MySQL to Memcache reduces database load, but large number of requests still add strain to web server
    55. 55. Moving to image and JavaScript widgets, which are updated from a single AJAX request</li></li></ul><li>Lots of Data: Race Conditions<br />Sessions in PHP close at the end of the execution cycle<br /><ul><li>Unpredictable query times
    56. 56. Large number of concurrent requests per screen</li></ul>Session Locking<br />Completely locks out a users session, as PHP hasn’t closed the session<br />
    57. 57. Race Conditions: PHP & Sessions<br />session_write_close()<br />Added after each write to the $_SESSION array. Closes the current session.<br />(requires a call to session_start immediately before any further reads or writes)<br />
    58. 58. Race Conditions: Use a ******* Template Engine<br /><ul><li>V1 of the system mixed PHP and HTML 
    59. 59. You can’t re-initialise your session once output has been sent
    60. 60. All new code uses a template engine, so session interaction has no bearing on output. When the template is processed and output, all database and session work has been completed long before.</li></li></ul><li>Race Conditions: Use a Single Entry Point<br /><ul><li>Race conditions are further exacerbated by the PHP timeout values
    61. 61. Certain exports, actions and processes take longer than 30 seconds, so the default execution time is longer
    62. 62. Initially the project lacked a single entry point, and execution flow was muddled
    63. 63. Single Entry Point makes it easier to enforce a lower time out, which is overridden by intensive controllers or models</li></li></ul><li>Intensive queries & Calculations<br /><ul><li>How far did this vehicle travel?
    64. 64. Motor RPM x Various vehicle specific constants
    65. 65. Calculated for every RPM value held during drive process
    66. 66. How much energy did the vehicle use
    67. 67. Battery Current x Battery Voltage x Time
    68. 68. For every current and voltage value combination held during the driving process
    69. 69. How well was the vehicle driven
    70. 70. Analysis of idle time
    71. 71. Harshness of accelerator and brake pedal usage
    72. 72. Inappropriate duration of AC / Heater on time?
    73. 73. What about for a customers fleet, or all of our vehicles sold?</li></li></ul><li>Intensive Queries & Calculations<br />
    74. 74. Intensive queries & Calculations<br /><ul><li>Involves a fair number of queries per vehicle
    75. 75. Calculations involve holding this data in memory
    76. 76. Processing required for every single record for that piece of data during that day</li></ul>Takes a while!<br />Solution:<br /><ul><li>Calculate information overnight
    77. 77. Save it as a compiled report
    78. 78. Lookups and comparisons only need to look at the compiled / saved reports in the database</li></li></ul><li>Reports<br />In addition to our calculated reports, we also need to export key bits of information to grant authorities<br /><ul><li>Initially our PHP based export scripts held one database connection per database (~400 databases)
    79. 79. Re-wrote to maintain only one connection per server, and switch the database used
    80. 80. Toggles to instruct the export to only apply for 1 of the servers at a time
    81. 81. Modulus magic to run multiple export scripts per server</li></li></ul><li>Triggers and Events<br />Currently a work-in-progress R&D project, evaluating two options:<br /><ul><li>Golden hammer: Use PHP
    82. 82. Run PHP as a daemon
    83. 83.
    84. 84. Continually monitor for specific changes to memcache variables
    85. 85. Node.js
    86. 86. Light weight and fast
    87. 87. Give PHP another friend
    88. 88. Link into PHP based API to run triggers </li></li></ul><li>The Future<br /><ul><li>More sharding
    89. 89. Based on time – keep the individual tables smaller
    90. 90. NoSQL?
    91. 91. Currently investigating NoSQL solutions as alternatives
    92. 92. Rationalisation
    93. 93. Do we need as much data as we collect?
    94. 94. Abstraction
    95. 95. We need to continually abstract concepts and ideas to make on-going maintenance and expansion easier; especially in terms of mapping code to database shards
    96. 96. More hardware
    97. 97. Expand our DB cluster, more RAM, R&D
    98. 98. Design
    99. 99. A much needed design refresh</li></li></ul><li>Conclusions<br /><ul><li>Make the solution scalable from the start
    100. 100. Where data collection is critical, use a message queue, ideally hosted or “cloud based”
    101. 101. Hire a genius DBA to push your database engine
    102. 102. Make use of data caching systems to reduce strain on the database
    103. 103. Calculations and post-processing should be done during dead time and automated
    104. 104. Add more tools to your toolbox – PHP needs lots of friends in these situations
    105. 105. Watch out for Session race conditions: where they can’t be avoided, use session_write_close, a template engine and a single entry point
    106. 106. Reduce the number of continuous AJAX calls</li></li></ul><li>Q & A<br />Michael Peacock<br />Web Systems Developer – Telemetry Team – Smith Electric Vehicles US Corp<br /><br />Senior / Lead Developer, Author & Entrepreneur<br /> <br /><br />@michaelpeacock<br /><br /><br /> Extra information!<br />