Evolution of a “big data”         project        Michael Peacock
@michaelpeacock
@michaelpeacockHead Developer @groundsix
@michaelpeacockHead Developer @groundsixAuthor of a number of web related books
@michaelpeacockHead Developer @groundsixAuthor of a number of web related booksOccasional conference / user group speaker
Ground Six
Ground SixTech company based in the North East ofEngland
Ground SixTech company based in the North East ofEnglandSpecialise in developing web and mobileapplications
Ground SixTech company based in the North East ofEnglandSpecialise in developing web and mobileapplicationsProvide investm...
Ground SixTech company based in the North East ofEnglandSpecialise in developing web and mobileapplicationsProvide investm...
Whats in store
Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day
Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day   Processing and storing t...
Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day   Processing and storing t...
Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day   Processing and storing t...
Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day   Processing and storing t...
Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day   Processing and storing t...
Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day   Processing and storing t...
Vehicle Telematics
Electric Vehicles: Need        for Data
Electric Vehicles: Need        for DataWe need to receive all of the data
Electric Vehicles: Need        for DataWe need to receive all of the dataWe need to keep all of the data
Electric Vehicles: Need        for DataWe need to receive all of the dataWe need to keep all of the dataWe need to be able...
Electric Vehicles: Need        for DataWe need to receive all of the dataWe need to keep all of the dataWe need to be able...
Electric Vehicles: Need        for DataWe need to receive all of the dataWe need to keep all of the dataWe need to be able...
Some stats
Some stats500 (approx) telemetry enabled vehiclesusing the system
Some stats500 (approx) telemetry enabled vehiclesusing the system2500 data points captured per vehicle persecond
Some stats500 (approx) telemetry enabled vehiclesusing the system2500 data points captured per vehicle persecond> 1.5 bill...
Some stats500 (approx) telemetry enabled vehiclesusing the system2500 data points captured per vehicle persecond> 1.5 bill...
More statsConstant minimum of 4000 inserts persecond within the applicationPeaks:  3 million inserts per second
Processing and storing      the data
Receiving continuous   data streams
Receiving continuous   data streamsWe need to be online
Receiving continuous   data streamsWe need to be onlineWe need to have capacity to process thedata
Receiving continuous   data streamsWe need to be onlineWe need to have capacity to process thedataWe need to scale
Message QueueFast, secure, reliable and scalableHosted: they worry about the serverinfrastructure and availabilityWe only ...
AMQP + PHPphp-amqplib (github.com/videlalvaro/php-amqplib)  OR install it via composer:   videlalvaro/php-amqplibPure PHP ...
AMQP: Consume// connect to the AMQP server$connection = new AMQPConnection($host,$port,$user,$password);// create a channe...
Buffers
Pulling in the dataDedicated application and hardware toconsume from the Message Queue andconvert to MySQL InsertsMySQL: L...
Optimising MySQLinnodb_flush_method=O_DIRECT    Lets the buffer pool bypass the OS cache    InnoDB buffer pools more efficie...
Sharding (1)Evaluate data, look for natural break pointsSplit the data so each data collection unit(vehicle) had a seperat...
System architecture
But the MQ can store data...why    do you have a problem?  Message Queue isn’t designed for storage  Messages are transfer...
Sam LambertSolves big-data MySQL problems forbreakfastConstantly tweaking the servers andconfiguration to get more and more...
Querying the data      QUICKLY!
Graphs! Slow!
Long Running QueriesMore and more vehicles came into serviceHuge amount of data resulted in very slowqueries  Page load  S...
Real time informationOriginal database schema dictated allinformation was accessed via a query, or aseparate subquery. Exp...
RequestsAsynchronous requests let the page load beforethe dataNumber of these requests had to be monitoredReal time inform...
Requests: Optimised
Single entry pointMultiple entry points make it difficult todynamically change the time out andmemory usage of key pages, a...
Symfony Routing// load your routes$locator = new FileLocator( array(__DIR__ . /../../ ) );$loader = new YamlFileLoader( $l...
Sharding: split the data into      smaller buckets
Sharding (2)Data is very time relevant  Only care about specific days  Don’t care about comparing data too muchSplit the da...
Supporting Sharding                  Simple PHP function to run all queries                  through. Works out the table ...
Sharding: an excuseAlterations to the database schemaCode to support smaller buckets of dataTake advantage of needing to t...
Index OptimisationTwo sharding projects left the schema as afrankenstienIndexes still had data from before the first shard(...
Schema OptimisationMySQL provides a range of data-typesVarying storage implications   Does that need to be a BIGINT   Do y...
Query OptimisationRun your queries through EXPLAINEXTENDED  Check they hit the indexesFor big queries avoid functions such...
Reporting against the        data
Performance report
Reports & Intensive     QueriesHow far did the vehicle travel today  Calculation involves looking at every single  motor s...
Group the queriesLeverage indexes  Perform related queries in succession  Then perform calculationsCatching up on a backlo...
Save the reportAutomate the queries in dead time, groupedtogether nicelySave the results in a reports tableOnly a single r...
Enables date-range aggregation
Check for efficiency      savingsInitial export scripts maintained a MySQLiconnection per database (500!)Updated to maintai...
Leverage your RAMIntensive queries might only use X% of yourRAMSafe to run more than one report / exportat a timeAdd suppo...
$numberOfConcurrentReportsToRun = 2;$reportInstance = 0;$counter = 0;foreach( $data as $unit ) {! if( ( $counter % $number...
Extrapolate & AssumeData is only stored when it changesKnown assumptions are used to extrapolatevalues for all seconds of ...
Interlation  * Add an array to the interlationpublic function addArray( $name, $array )  * Get the time that we first rece...
Food for thoughtGearman  Tool to schedule and run background jobs
Keeping the application      responsive
Session LockingSome queries were still (understandably, andacceptably) slowSessions would lock and AJAX scripts wouldenter...
Session Locking:      ResolutionSession locking caused by how PHP handlessessions;  Session file is closed once it has finis...
Closing the sessionsession_write_close();Caveats:  If you need to write to sessions again in  the execution cycle, you mus...
Live real-time dataRequest consolidation helpedEach data point on the live screen was still aseparate query due to origina...
Caching with memcached      Fast, in-memory key-value store         Used to keep a copy of the most recent         data fr...
Caching enables large range of data to be looked up quickly
Legacy ProjectConstraints, problems and code. Easing         deployment anxiety.
Source Control       ManagementInitially SVNMigrated to git  Branch per feature strategyAutomated deployment
DependenciesDependency Injection framework missingfrom the application, caused problems with:  Authentication  Memcache  H...
AutoloadingPSR-0
Templates and sessionsClosing and opening sessions means you needto know when data has been sent to thebrowserSeparation o...
Database rollouts              Specific database table defines how the data should              be processed              Lo...
private function applyNextPatch( $currentPatchID ) {	 $patchToTry = ++$currentPatchID;	 if( file_exists( FRAMEWORK_PATH . ...
The future
Tiered SAN hardware
NoSQL?MySQL was used as a “golden hammer”Original team of contractors who built thesystem knew itEasy to hire developers w...
RationalisationDo we need all that data? Really?  At the moment: probably  In the future: probably not
Direct queue interaction Types of message queue could allow our live data to be streamed direct from a queue We could use ...
More hardwareMore vehicles + New components = Need formore storage
ConclusionsSo you need to work with a crap-load of data?
PHP needs lots of        friendsPHP is a great tool for:   Displaying the data   Processing the data   Exporting the data ...
Continually ReviewYour schema & indexesYour queriesEfficiencies in your codeNumber of AJAX requests
Message Queue: A     safety netQueue what you canLets you move data around while you processitGives your hardware some bre...
Code ConsiderationsTemplate enginesDependency managementAbstractionAutoloadingSession handlingRequest management
Compile DataKeep related data togetherLook at storing summaries of data   Approach used by analytics companies: granularit...
Thanks; Q+AMichael Peacockmkpeacock@gmail.com@michaelpeacockwww.michaelpeacock.co.uk
Photo creditsflickr.com/photos/itmpa/4531956496/flickr.com/photos/eveofdiscovery/3149008295flickr.com/photos/gadl/89650415/fli...
Evolution of a big data project
Evolution of a big data project
Upcoming SlideShare
Loading in...5
×

Evolution of a big data project

434

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
434
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Hello everyone; Thanks for coming. I spent the last 12 months working on a large scale data intensive project, focusing on the development of a PHP web application which had to support, display, process, report against and export a pheonomenal amount of data each day.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • The project concerned dealing with vehicle telematics data from vehicles produced by Smith Electric Vehicles. One of the worlds largest manufacturers of all-electric commercial vehicles. As a new and emerging industry, performance, efficiency and fault reporting data from these vehicles is very valuable. As I’m sure you can imagine, with electric vehicles the drive and battery systems generate a large amount of data - with batteries broken down into smaller cells, each giving us temperature, current, voltage and state of charge data.\n
  • As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • \n
  • before we could do anything - we need to be able to process the data and store it within the system. This includes actually transferring the data to our servers, inserting it into our database cluster and performing business logic on the data.\n
  • In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
  • In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
  • In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
  • The biggest problem is dealing with the pressure of that data stream. \n
  • \n
  • \n
  • There are a range of AMQP libraries for PHP, some of them based off the C-library and other difficult dependencies. \n\nA couple of guys developed a pure PHP implementation of the library which is really easy to use and install, and can be installed directly via Composer. As its a pure PHP implementation its really easy to get up and running on any platform.\n\nProvides support for both publishing and consuming messages from a queue.Great not only for dealing with streams of data but also for storing events and requests across multiple sessions, or dispatching jobs.\n
  • \n
  • A small buffer allows us to cope with the issue of connectivity problems to our message queue, or signal problems with the data collection devices.\n
  • To give data import the resources it needs, the system had dedicated hardware to consume messages from the message queue, perform business logic and convert them to MySQL Inserts.\n\nAlthough its an obvious one, its also easily overlooked. The data is bundled together into LOAD DATA INFILE statements with MySQL. \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • With a project of this scale, dealing with business-critical data could lead to deployment anxiety. This is because a bug in rolled out code could cause problems with displaying real time data, or cause exported data or processed data reports to be incorrect; requiring them to be re-run at a cost of CPU time - most of which was already in use generating that days reports or dealing with that days data imports. Architecture of the application also provided constraints for maintenance.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Evolution of a big data project

    1. 1. Evolution of a “big data” project Michael Peacock
    2. 2. @michaelpeacock
    3. 3. @michaelpeacockHead Developer @groundsix
    4. 4. @michaelpeacockHead Developer @groundsixAuthor of a number of web related books
    5. 5. @michaelpeacockHead Developer @groundsixAuthor of a number of web related booksOccasional conference / user group speaker
    6. 6. Ground Six
    7. 7. Ground SixTech company based in the North East ofEngland
    8. 8. Ground SixTech company based in the North East ofEnglandSpecialise in developing web and mobileapplications
    9. 9. Ground SixTech company based in the North East ofEnglandSpecialise in developing web and mobileapplicationsProvide investment (financial and tech) tointeresting app ideas
    10. 10. Ground SixTech company based in the North East ofEnglandSpecialise in developing web and mobileapplicationsProvide investment (financial and tech) tointeresting app ideasGot an idea? Looking for investment?www.groundsix.com
    11. 11. Whats in store
    12. 12. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day
    13. 13. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data
    14. 14. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data Querying the data quickly
    15. 15. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data
    16. 16. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data Keeping the application responsive
    17. 17. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data Keeping the application responsive Keeping the application running
    18. 18. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data Keeping the application responsive Keeping the application running Legacy project, problems and code
    19. 19. Vehicle Telematics
    20. 20. Electric Vehicles: Need for Data
    21. 21. Electric Vehicles: Need for DataWe need to receive all of the data
    22. 22. Electric Vehicles: Need for DataWe need to receive all of the dataWe need to keep all of the data
    23. 23. Electric Vehicles: Need for DataWe need to receive all of the dataWe need to keep all of the dataWe need to be able to display data in real time
    24. 24. Electric Vehicles: Need for DataWe need to receive all of the dataWe need to keep all of the dataWe need to be able to display data in real timeWe need to transfer large chunks of data tocustomers and government departments
    25. 25. Electric Vehicles: Need for DataWe need to receive all of the dataWe need to keep all of the dataWe need to be able to display data in real timeWe need to transfer large chunks of data tocustomers and government departmentsWe need to be able to calculate performancemetrics from the data
    26. 26. Some stats
    27. 27. Some stats500 (approx) telemetry enabled vehiclesusing the system
    28. 28. Some stats500 (approx) telemetry enabled vehiclesusing the system2500 data points captured per vehicle persecond
    29. 29. Some stats500 (approx) telemetry enabled vehiclesusing the system2500 data points captured per vehicle persecond> 1.5 billion MySQL inserts per day
    30. 30. Some stats500 (approx) telemetry enabled vehiclesusing the system2500 data points captured per vehicle persecond> 1.5 billion MySQL inserts per dayWorlds largest vehicle telematics projectoutside of Formula 1
    31. 31. More statsConstant minimum of 4000 inserts persecond within the applicationPeaks: 3 million inserts per second
    32. 32. Processing and storing the data
    33. 33. Receiving continuous data streams
    34. 34. Receiving continuous data streamsWe need to be online
    35. 35. Receiving continuous data streamsWe need to be onlineWe need to have capacity to process thedata
    36. 36. Receiving continuous data streamsWe need to be onlineWe need to have capacity to process thedataWe need to scale
    37. 37. Message QueueFast, secure, reliable and scalableHosted: they worry about the serverinfrastructure and availabilityWe only have to process what we can
    38. 38. AMQP + PHPphp-amqplib (github.com/videlalvaro/php-amqplib) OR install it via composer: videlalvaro/php-amqplibPure PHP implementationHandles publishing and consuming messagesfrom a queue
    39. 39. AMQP: Consume// connect to the AMQP server$connection = new AMQPConnection($host,$port,$user,$password);// create a channel; a logical stateful link to our physical connection$channel = $connection->channel();// link the channel to an exchange (where messages are sent)$channel->exchange_declare($exchange, ‘direct’);// bind the channel to the queue$channel->queue_bind($queue, $exchange);// consume by sending the message to our processing callback function$channel->basic_consume($queue, $consumerTag, false, false, false,$callbackFunctionName);while(count($channel->callbacks)){ $channel->wait();}
    40. 40. Buffers
    41. 41. Pulling in the dataDedicated application and hardware toconsume from the Message Queue andconvert to MySQL InsertsMySQL: LOAD DATA INFILE Very fast Due to high volumes of data, these “bulk operations” only cover a few seconds of time - still giving a live stream of data
    42. 42. Optimising MySQLinnodb_flush_method=O_DIRECT Lets the buffer pool bypass the OS cache InnoDB buffer pools more efficient that OS Can have negative side effectsImprove write performance: innodb_flush_log_at_trx_commit=2 Prevents per-commit log flushingQuery cache size (query_cache_size) Measure your applications usage and make a judgement Our data stream was too frequent to make use of the cache
    43. 43. Sharding (1)Evaluate data, look for natural break pointsSplit the data so each data collection unit(vehicle) had a seperate databaseGives some support for horizontal scaling Provided the data per vehicle is a reasonable size
    44. 44. System architecture
    45. 45. But the MQ can store data...why do you have a problem? Message Queue isn’t designed for storage Messages are transferred in a compressed form Nature of vehicle data (CAN) means that a 16 character string is actually 4 - 64 pieces of data
    46. 46. Sam LambertSolves big-data MySQL problems forbreakfastConstantly tweaking the servers andconfiguration to get more and moreperformancePushing the capabilities of our SAN,tweaking configs where no DBA has gonebeforewww.samlambert.comhttp://www.samlambert.com/2011/07/how-to-push-your-san-with-open-iscsi_13.htmlhttp://www.samlambert.com/2011/07/diagnosing-and-fixing-mysql-io.htmlTwitter: @isamlambert
    47. 47. Querying the data QUICKLY!
    48. 48. Graphs! Slow!
    49. 49. Long Running QueriesMore and more vehicles came into serviceHuge amount of data resulted in very slowqueries Page load Session locking Slow exports Slow backups
    50. 50. Real time informationOriginal database schema dictated allinformation was accessed via a query, or aseparate subquery. Expensive.Live information: Up to 30 data points Refreshing every 5 - 30 seconds via AJAXPainful
    51. 51. RequestsAsynchronous requests let the page load beforethe dataNumber of these requests had to be monitoredReal time information used Fusion Charts 1 AJAX call per chart 10 - 30 charts per vehicle live screen Refresh every 5 - 30 seconds
    52. 52. Requests: Optimised
    53. 53. Single entry pointMultiple entry points make it difficult todynamically change the time out andmemory usage of key pages, as well asdealing with session locking issueseffectively.Single point of entry is essentialCheckout the symfony routing component...
    54. 54. Symfony Routing// load your routes$locator = new FileLocator( array(__DIR__ . /../../ ) );$loader = new YamlFileLoader( $locator );$loader->load(routes.yml);$request = ( isset( $_SERVER[REQUEST_URI] ) ) ? $_SERVER[REQUEST_URI] : ;$requestContext = new RequestContext( $request );// Setup the router$router = new RoutingRouter( new YamlFileLoader( $locator ), "routes.yml",array(cache_dir => null), $requestContext );$requestURL = ( isset( $_SERVER[REQUEST_URI] ) ) ? $_SERVER[REQUEST_URI] : ;$requestURL = (strlen( $requestURL ) > 1 ) ? rtrim( $requestURL, / ) : $requestURL;// get the route for your request$route = $this->router->match( $requestURL );// act on the route
    55. 55. Sharding: split the data into smaller buckets
    56. 56. Sharding (2)Data is very time relevant Only care about specific days Don’t care about comparing data too muchSplit the data so that each week had aseparate table
    57. 57. Supporting Sharding Simple PHP function to run all queries through. Works out the table name. Link with a sprintf to get the full query string/** * Get the sharded table to use from a specific date * @param String $date YYYY-MM-DD * @return String */public function getTableNameFromDate( $date ){ // ASSUMPTION: todays database is ALWAYS THERE // ASSUMPTION: You shouldnt be querying for data in the future $date = ( $date > date( Y-m-d) ) ? date(Y-m-d) : $date; $stt = strtotime( $date ); if( $date >= $this->switchOver ) { $year = ( date( m, $stt ) == 01 && date( W, $stt ) == 52 ) ? date(Y, $stt ) - 1 : date(Y, $stt ); return datavalue_ . $year . _ . date(W, $stt ); } else { return datavalue; }}
    58. 58. Sharding: an excuseAlterations to the database schemaCode to support smaller buckets of dataTake advantage of needing to touch queriesand code: improve them!
    59. 59. Index OptimisationTwo sharding projects left the schema as afrankenstienIndexes still had data from before the first shard(the vehicle ID) Wasting storage space Increasing the index size Increasing query time Makes the index harder to fit into memory
    60. 60. Schema OptimisationMySQL provides a range of data-typesVarying storage implications Does that need to be a BIGINT Do you really need DOUBLE PRECISION when a FLOAT will do?Are those tables, fields or databases still required?Perform regular schema audits
    61. 61. Query OptimisationRun your queries through EXPLAINEXTENDED Check they hit the indexesFor big queries avoid functions such asCURDATE - this helps ensure the cache is hit
    62. 62. Reporting against the data
    63. 63. Performance report
    64. 64. Reports & Intensive QueriesHow far did the vehicle travel today Calculation involves looking at every single motor speed value for the dayHow much energy did the vehicle use today Calculation involves looking at multiple variables for every second of the dayLookup time + calculation time
    65. 65. Group the queriesLeverage indexes Perform related queries in succession Then perform calculationsCatching up on a backlog of calculations andexports? Do a table of queries at a time Make use of indexes
    66. 66. Save the reportAutomate the queries in dead time, groupedtogether nicelySave the results in a reports tableOnly a single record per vehicle per day ofperformance data Means users and management can run aggregate and comparison queries themselves quickly and easily
    67. 67. Enables date-range aggregation
    68. 68. Check for efficiency savingsInitial export scripts maintained a MySQLiconnection per database (500!)Updated to maintain one per server andsimply switch to the database in question
    69. 69. Leverage your RAMIntensive queries might only use X% of yourRAMSafe to run more than one report / exportat a timeAdd support for multiple exports / reportswithin your scripts e.g.
    70. 70. $numberOfConcurrentReportsToRun = 2;$reportInstance = 0;$counter = 0;foreach( $data as $unit ) {! if( ( $counter % $numberOfConcurrentReportsToRun ) == $reportInstance ) {! ! $dataToProcess[] = $unit;! }! $counter++;}
    71. 71. Extrapolate & AssumeData is only stored when it changesKnown assumptions are used to extrapolatevalues for all seconds of the daySaves MySQL but costs in RAM“Interlation”
    72. 72. Interlation * Add an array to the interlationpublic function addArray( $name, $array ) * Get the time that we first receive data in one of our arrayspublic function getFirst( $field ) * Get the time that we last received data in any of our arrayspublic function getLast( $field ) * Generate the interlaced arraypublic function generate( $keyField, $valueField ) * Beak the interlaced array down into seperate dayspublic function dayBreak( $interlationArray ) * Generate an interlaced array and fill for all timestamps within the rangeof _first_ to _last_public function generateAndFill( $keyField, $valueField ) * Populate the new combined array with key fields using the common fieldpublic function populateKeysFromField( $field, $valueField=null )http://www.michaelpeacock.co.uk/interlation-library
    73. 73. Food for thoughtGearman Tool to schedule and run background jobs
    74. 74. Keeping the application responsive
    75. 75. Session LockingSome queries were still (understandably, andacceptably) slowSessions would lock and AJAX scripts wouldenter race conditionsUser would attempt to navigate to anotherpage: their session with the web serverwouldn’t respond
    76. 76. Session Locking: ResolutionSession locking caused by how PHP handlessessions; Session file is closed once it has finishes executing the requestPotential solution: use another method e.g.databaseOur solution: manually close the session
    77. 77. Closing the sessionsession_write_close();Caveats: If you need to write to sessions again in the execution cycle, you must call session_start() again Made problematic by the lack of template handling
    78. 78. Live real-time dataRequest consolidation helpedEach data point on the live screen was still aseparate query due to original designconstraintsLive fleet information spanned multipledatabases e.g. a map of all vehiclesbelonging to a customerSolution: caching
    79. 79. Caching with memcached Fast, in-memory key-value store Used to keep a copy of the most recent data from each vehicle$mc = new Memcache();$mc->connect($memcacheServer, $memcachePort);$realTimeData = $mc->get($vehicleID . ‘-’ . $dataVariable); Failover: Moxi Memcached Proxy
    80. 80. Caching enables large range of data to be looked up quickly
    81. 81. Legacy ProjectConstraints, problems and code. Easing deployment anxiety.
    82. 82. Source Control ManagementInitially SVNMigrated to git Branch per feature strategyAutomated deployment
    83. 83. DependenciesDependency Injection framework missingfrom the application, caused problems with: Authentication Memcache Handling multiple concurrent database connections Access control
    84. 84. AutoloadingPSR-0
    85. 85. Templates and sessionsClosing and opening sessions means you needto know when data has been sent to thebrowserSeparation of concerns and template systemshelp with this
    86. 86. Database rollouts Specific database table defines how the data should be processed Log database deltas Automated process to roll out changes Backup existing table firstDATE=`date +%H-%M-%d-%m-%y`mysqldump -h HOST -u USER -pPASSWORD DATABASE TABLENAME > /backups/dictionary_$DATE.sqlcd /var/www/pdictionarypatcher/repo/git pull origin mastercd srcphp index.php Rollout changes
    87. 87. private function applyNextPatch( $currentPatchID ) { $patchToTry = ++$currentPatchID; if( file_exists( FRAMEWORK_PATH . ../patches/ . $patchToTry . .php ) ) { $sql = file_get_contents( FRAMEWORK_PATH . ../patches/ . $patchToTry . .php ); $this->database->multi_query( $sql ); return $this->applyNextPatch( $patchToTry ); } else { return $patchToTry-1; }}
    88. 88. The future
    89. 89. Tiered SAN hardware
    90. 90. NoSQL?MySQL was used as a “golden hammer”Original team of contractors who built thesystem knew itEasy to hire developers who know itNot necessarily the best optionWe had to introduce application-levelsharding for it to suite the growing needs
    91. 91. RationalisationDo we need all that data? Really? At the moment: probably In the future: probably not
    92. 92. Direct queue interaction Types of message queue could allow our live data to be streamed direct from a queue We could use this infrastructure to share the data with partners instead of providing them regular processed exports
    93. 93. More hardwareMore vehicles + New components = Need formore storage
    94. 94. ConclusionsSo you need to work with a crap-load of data?
    95. 95. PHP needs lots of friendsPHP is a great tool for: Displaying the data Processing the data Exporting the data Binding business logic to the dataIt needs friends to: Queue the data Insert the data Visualise the data
    96. 96. Continually ReviewYour schema & indexesYour queriesEfficiencies in your codeNumber of AJAX requests
    97. 97. Message Queue: A safety netQueue what you canLets you move data around while you processitGives your hardware some breathing space
    98. 98. Code ConsiderationsTemplate enginesDependency managementAbstractionAutoloadingSession handlingRequest management
    99. 99. Compile DataKeep related data togetherLook at storing summaries of data Approach used by analytics companies: granularity changes over time: This week: per second data Last week: Hourly summaries Last month: Daily summaries Last year: Monthly summaries
    100. 100. Thanks; Q+AMichael Peacockmkpeacock@gmail.com@michaelpeacockwww.michaelpeacock.co.uk
    101. 101. Photo creditsflickr.com/photos/itmpa/4531956496/flickr.com/photos/eveofdiscovery/3149008295flickr.com/photos/gadl/89650415/flickr.com/photos/brapps/403257780
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×