Your SlideShare is downloading. ×

Evolution of a big data project

399

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
399
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Hello everyone; Thanks for coming. I spent the last 12 months working on a large scale data intensive project, focusing on the development of a PHP web application which had to support, display, process, report against and export a pheonomenal amount of data each day.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • The project concerned dealing with vehicle telematics data from vehicles produced by Smith Electric Vehicles. One of the worlds largest manufacturers of all-electric commercial vehicles. As a new and emerging industry, performance, efficiency and fault reporting data from these vehicles is very valuable. As I’m sure you can imagine, with electric vehicles the drive and battery systems generate a large amount of data - with batteries broken down into smaller cells, each giving us temperature, current, voltage and state of charge data.\n
  • As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • \n
  • before we could do anything - we need to be able to process the data and store it within the system. This includes actually transferring the data to our servers, inserting it into our database cluster and performing business logic on the data.\n
  • In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
  • In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
  • In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
  • The biggest problem is dealing with the pressure of that data stream. \n
  • \n
  • \n
  • There are a range of AMQP libraries for PHP, some of them based off the C-library and other difficult dependencies. \n\nA couple of guys developed a pure PHP implementation of the library which is really easy to use and install, and can be installed directly via Composer. As its a pure PHP implementation its really easy to get up and running on any platform.\n\nProvides support for both publishing and consuming messages from a queue.Great not only for dealing with streams of data but also for storing events and requests across multiple sessions, or dispatching jobs.\n
  • \n
  • A small buffer allows us to cope with the issue of connectivity problems to our message queue, or signal problems with the data collection devices.\n
  • To give data import the resources it needs, the system had dedicated hardware to consume messages from the message queue, perform business logic and convert them to MySQL Inserts.\n\nAlthough its an obvious one, its also easily overlooked. The data is bundled together into LOAD DATA INFILE statements with MySQL. \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • With a project of this scale, dealing with business-critical data could lead to deployment anxiety. This is because a bug in rolled out code could cause problems with displaying real time data, or cause exported data or processed data reports to be incorrect; requiring them to be re-run at a cost of CPU time - most of which was already in use generating that days reports or dealing with that days data imports. Architecture of the application also provided constraints for maintenance.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. Evolution of a “big data” project Michael Peacock
    • 2. @michaelpeacock
    • 3. @michaelpeacockHead Developer @groundsix
    • 4. @michaelpeacockHead Developer @groundsixAuthor of a number of web related books
    • 5. @michaelpeacockHead Developer @groundsixAuthor of a number of web related booksOccasional conference / user group speaker
    • 6. Ground Six
    • 7. Ground SixTech company based in the North East ofEngland
    • 8. Ground SixTech company based in the North East ofEnglandSpecialise in developing web and mobileapplications
    • 9. Ground SixTech company based in the North East ofEnglandSpecialise in developing web and mobileapplicationsProvide investment (financial and tech) tointeresting app ideas
    • 10. Ground SixTech company based in the North East ofEnglandSpecialise in developing web and mobileapplicationsProvide investment (financial and tech) tointeresting app ideasGot an idea? Looking for investment?www.groundsix.com
    • 11. Whats in store
    • 12. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day
    • 13. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data
    • 14. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data Querying the data quickly
    • 15. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data
    • 16. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data Keeping the application responsive
    • 17. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data Keeping the application responsive Keeping the application running
    • 18. Whats in storeChallenges, solutions and approaches when dealing withbillions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data Keeping the application responsive Keeping the application running Legacy project, problems and code
    • 19. Vehicle Telematics
    • 20. Electric Vehicles: Need for Data
    • 21. Electric Vehicles: Need for DataWe need to receive all of the data
    • 22. Electric Vehicles: Need for DataWe need to receive all of the dataWe need to keep all of the data
    • 23. Electric Vehicles: Need for DataWe need to receive all of the dataWe need to keep all of the dataWe need to be able to display data in real time
    • 24. Electric Vehicles: Need for DataWe need to receive all of the dataWe need to keep all of the dataWe need to be able to display data in real timeWe need to transfer large chunks of data tocustomers and government departments
    • 25. Electric Vehicles: Need for DataWe need to receive all of the dataWe need to keep all of the dataWe need to be able to display data in real timeWe need to transfer large chunks of data tocustomers and government departmentsWe need to be able to calculate performancemetrics from the data
    • 26. Some stats
    • 27. Some stats500 (approx) telemetry enabled vehiclesusing the system
    • 28. Some stats500 (approx) telemetry enabled vehiclesusing the system2500 data points captured per vehicle persecond
    • 29. Some stats500 (approx) telemetry enabled vehiclesusing the system2500 data points captured per vehicle persecond> 1.5 billion MySQL inserts per day
    • 30. Some stats500 (approx) telemetry enabled vehiclesusing the system2500 data points captured per vehicle persecond> 1.5 billion MySQL inserts per dayWorlds largest vehicle telematics projectoutside of Formula 1
    • 31. More statsConstant minimum of 4000 inserts persecond within the applicationPeaks: 3 million inserts per second
    • 32. Processing and storing the data
    • 33. Receiving continuous data streams
    • 34. Receiving continuous data streamsWe need to be online
    • 35. Receiving continuous data streamsWe need to be onlineWe need to have capacity to process thedata
    • 36. Receiving continuous data streamsWe need to be onlineWe need to have capacity to process thedataWe need to scale
    • 37. Message QueueFast, secure, reliable and scalableHosted: they worry about the serverinfrastructure and availabilityWe only have to process what we can
    • 38. AMQP + PHPphp-amqplib (github.com/videlalvaro/php-amqplib) OR install it via composer: videlalvaro/php-amqplibPure PHP implementationHandles publishing and consuming messagesfrom a queue
    • 39. AMQP: Consume// connect to the AMQP server$connection = new AMQPConnection($host,$port,$user,$password);// create a channel; a logical stateful link to our physical connection$channel = $connection->channel();// link the channel to an exchange (where messages are sent)$channel->exchange_declare($exchange, ‘direct’);// bind the channel to the queue$channel->queue_bind($queue, $exchange);// consume by sending the message to our processing callback function$channel->basic_consume($queue, $consumerTag, false, false, false,$callbackFunctionName);while(count($channel->callbacks)){ $channel->wait();}
    • 40. Buffers
    • 41. Pulling in the dataDedicated application and hardware toconsume from the Message Queue andconvert to MySQL InsertsMySQL: LOAD DATA INFILE Very fast Due to high volumes of data, these “bulk operations” only cover a few seconds of time - still giving a live stream of data
    • 42. Optimising MySQLinnodb_flush_method=O_DIRECT Lets the buffer pool bypass the OS cache InnoDB buffer pools more efficient that OS Can have negative side effectsImprove write performance: innodb_flush_log_at_trx_commit=2 Prevents per-commit log flushingQuery cache size (query_cache_size) Measure your applications usage and make a judgement Our data stream was too frequent to make use of the cache
    • 43. Sharding (1)Evaluate data, look for natural break pointsSplit the data so each data collection unit(vehicle) had a seperate databaseGives some support for horizontal scaling Provided the data per vehicle is a reasonable size
    • 44. System architecture
    • 45. But the MQ can store data...why do you have a problem? Message Queue isn’t designed for storage Messages are transferred in a compressed form Nature of vehicle data (CAN) means that a 16 character string is actually 4 - 64 pieces of data
    • 46. Sam LambertSolves big-data MySQL problems forbreakfastConstantly tweaking the servers andconfiguration to get more and moreperformancePushing the capabilities of our SAN,tweaking configs where no DBA has gonebeforewww.samlambert.comhttp://www.samlambert.com/2011/07/how-to-push-your-san-with-open-iscsi_13.htmlhttp://www.samlambert.com/2011/07/diagnosing-and-fixing-mysql-io.htmlTwitter: @isamlambert
    • 47. Querying the data QUICKLY!
    • 48. Graphs! Slow!
    • 49. Long Running QueriesMore and more vehicles came into serviceHuge amount of data resulted in very slowqueries Page load Session locking Slow exports Slow backups
    • 50. Real time informationOriginal database schema dictated allinformation was accessed via a query, or aseparate subquery. Expensive.Live information: Up to 30 data points Refreshing every 5 - 30 seconds via AJAXPainful
    • 51. RequestsAsynchronous requests let the page load beforethe dataNumber of these requests had to be monitoredReal time information used Fusion Charts 1 AJAX call per chart 10 - 30 charts per vehicle live screen Refresh every 5 - 30 seconds
    • 52. Requests: Optimised
    • 53. Single entry pointMultiple entry points make it difficult todynamically change the time out andmemory usage of key pages, as well asdealing with session locking issueseffectively.Single point of entry is essentialCheckout the symfony routing component...
    • 54. Symfony Routing// load your routes$locator = new FileLocator( array(__DIR__ . /../../ ) );$loader = new YamlFileLoader( $locator );$loader->load(routes.yml);$request = ( isset( $_SERVER[REQUEST_URI] ) ) ? $_SERVER[REQUEST_URI] : ;$requestContext = new RequestContext( $request );// Setup the router$router = new RoutingRouter( new YamlFileLoader( $locator ), "routes.yml",array(cache_dir => null), $requestContext );$requestURL = ( isset( $_SERVER[REQUEST_URI] ) ) ? $_SERVER[REQUEST_URI] : ;$requestURL = (strlen( $requestURL ) > 1 ) ? rtrim( $requestURL, / ) : $requestURL;// get the route for your request$route = $this->router->match( $requestURL );// act on the route
    • 55. Sharding: split the data into smaller buckets
    • 56. Sharding (2)Data is very time relevant Only care about specific days Don’t care about comparing data too muchSplit the data so that each week had aseparate table
    • 57. Supporting Sharding Simple PHP function to run all queries through. Works out the table name. Link with a sprintf to get the full query string/** * Get the sharded table to use from a specific date * @param String $date YYYY-MM-DD * @return String */public function getTableNameFromDate( $date ){ // ASSUMPTION: todays database is ALWAYS THERE // ASSUMPTION: You shouldnt be querying for data in the future $date = ( $date > date( Y-m-d) ) ? date(Y-m-d) : $date; $stt = strtotime( $date ); if( $date >= $this->switchOver ) { $year = ( date( m, $stt ) == 01 && date( W, $stt ) == 52 ) ? date(Y, $stt ) - 1 : date(Y, $stt ); return datavalue_ . $year . _ . date(W, $stt ); } else { return datavalue; }}
    • 58. Sharding: an excuseAlterations to the database schemaCode to support smaller buckets of dataTake advantage of needing to touch queriesand code: improve them!
    • 59. Index OptimisationTwo sharding projects left the schema as afrankenstienIndexes still had data from before the first shard(the vehicle ID) Wasting storage space Increasing the index size Increasing query time Makes the index harder to fit into memory
    • 60. Schema OptimisationMySQL provides a range of data-typesVarying storage implications Does that need to be a BIGINT Do you really need DOUBLE PRECISION when a FLOAT will do?Are those tables, fields or databases still required?Perform regular schema audits
    • 61. Query OptimisationRun your queries through EXPLAINEXTENDED Check they hit the indexesFor big queries avoid functions such asCURDATE - this helps ensure the cache is hit
    • 62. Reporting against the data
    • 63. Performance report
    • 64. Reports & Intensive QueriesHow far did the vehicle travel today Calculation involves looking at every single motor speed value for the dayHow much energy did the vehicle use today Calculation involves looking at multiple variables for every second of the dayLookup time + calculation time
    • 65. Group the queriesLeverage indexes Perform related queries in succession Then perform calculationsCatching up on a backlog of calculations andexports? Do a table of queries at a time Make use of indexes
    • 66. Save the reportAutomate the queries in dead time, groupedtogether nicelySave the results in a reports tableOnly a single record per vehicle per day ofperformance data Means users and management can run aggregate and comparison queries themselves quickly and easily
    • 67. Enables date-range aggregation
    • 68. Check for efficiency savingsInitial export scripts maintained a MySQLiconnection per database (500!)Updated to maintain one per server andsimply switch to the database in question
    • 69. Leverage your RAMIntensive queries might only use X% of yourRAMSafe to run more than one report / exportat a timeAdd support for multiple exports / reportswithin your scripts e.g.
    • 70. $numberOfConcurrentReportsToRun = 2;$reportInstance = 0;$counter = 0;foreach( $data as $unit ) {! if( ( $counter % $numberOfConcurrentReportsToRun ) == $reportInstance ) {! ! $dataToProcess[] = $unit;! }! $counter++;}
    • 71. Extrapolate & AssumeData is only stored when it changesKnown assumptions are used to extrapolatevalues for all seconds of the daySaves MySQL but costs in RAM“Interlation”
    • 72. Interlation * Add an array to the interlationpublic function addArray( $name, $array ) * Get the time that we first receive data in one of our arrayspublic function getFirst( $field ) * Get the time that we last received data in any of our arrayspublic function getLast( $field ) * Generate the interlaced arraypublic function generate( $keyField, $valueField ) * Beak the interlaced array down into seperate dayspublic function dayBreak( $interlationArray ) * Generate an interlaced array and fill for all timestamps within the rangeof _first_ to _last_public function generateAndFill( $keyField, $valueField ) * Populate the new combined array with key fields using the common fieldpublic function populateKeysFromField( $field, $valueField=null )http://www.michaelpeacock.co.uk/interlation-library
    • 73. Food for thoughtGearman Tool to schedule and run background jobs
    • 74. Keeping the application responsive
    • 75. Session LockingSome queries were still (understandably, andacceptably) slowSessions would lock and AJAX scripts wouldenter race conditionsUser would attempt to navigate to anotherpage: their session with the web serverwouldn’t respond
    • 76. Session Locking: ResolutionSession locking caused by how PHP handlessessions; Session file is closed once it has finishes executing the requestPotential solution: use another method e.g.databaseOur solution: manually close the session
    • 77. Closing the sessionsession_write_close();Caveats: If you need to write to sessions again in the execution cycle, you must call session_start() again Made problematic by the lack of template handling
    • 78. Live real-time dataRequest consolidation helpedEach data point on the live screen was still aseparate query due to original designconstraintsLive fleet information spanned multipledatabases e.g. a map of all vehiclesbelonging to a customerSolution: caching
    • 79. Caching with memcached Fast, in-memory key-value store Used to keep a copy of the most recent data from each vehicle$mc = new Memcache();$mc->connect($memcacheServer, $memcachePort);$realTimeData = $mc->get($vehicleID . ‘-’ . $dataVariable); Failover: Moxi Memcached Proxy
    • 80. Caching enables large range of data to be looked up quickly
    • 81. Legacy ProjectConstraints, problems and code. Easing deployment anxiety.
    • 82. Source Control ManagementInitially SVNMigrated to git Branch per feature strategyAutomated deployment
    • 83. DependenciesDependency Injection framework missingfrom the application, caused problems with: Authentication Memcache Handling multiple concurrent database connections Access control
    • 84. AutoloadingPSR-0
    • 85. Templates and sessionsClosing and opening sessions means you needto know when data has been sent to thebrowserSeparation of concerns and template systemshelp with this
    • 86. Database rollouts Specific database table defines how the data should be processed Log database deltas Automated process to roll out changes Backup existing table firstDATE=`date +%H-%M-%d-%m-%y`mysqldump -h HOST -u USER -pPASSWORD DATABASE TABLENAME > /backups/dictionary_$DATE.sqlcd /var/www/pdictionarypatcher/repo/git pull origin mastercd srcphp index.php Rollout changes
    • 87. private function applyNextPatch( $currentPatchID ) { $patchToTry = ++$currentPatchID; if( file_exists( FRAMEWORK_PATH . ../patches/ . $patchToTry . .php ) ) { $sql = file_get_contents( FRAMEWORK_PATH . ../patches/ . $patchToTry . .php ); $this->database->multi_query( $sql ); return $this->applyNextPatch( $patchToTry ); } else { return $patchToTry-1; }}
    • 88. The future
    • 89. Tiered SAN hardware
    • 90. NoSQL?MySQL was used as a “golden hammer”Original team of contractors who built thesystem knew itEasy to hire developers who know itNot necessarily the best optionWe had to introduce application-levelsharding for it to suite the growing needs
    • 91. RationalisationDo we need all that data? Really? At the moment: probably In the future: probably not
    • 92. Direct queue interaction Types of message queue could allow our live data to be streamed direct from a queue We could use this infrastructure to share the data with partners instead of providing them regular processed exports
    • 93. More hardwareMore vehicles + New components = Need formore storage
    • 94. ConclusionsSo you need to work with a crap-load of data?
    • 95. PHP needs lots of friendsPHP is a great tool for: Displaying the data Processing the data Exporting the data Binding business logic to the dataIt needs friends to: Queue the data Insert the data Visualise the data
    • 96. Continually ReviewYour schema & indexesYour queriesEfficiencies in your codeNumber of AJAX requests
    • 97. Message Queue: A safety netQueue what you canLets you move data around while you processitGives your hardware some breathing space
    • 98. Code ConsiderationsTemplate enginesDependency managementAbstractionAutoloadingSession handlingRequest management
    • 99. Compile DataKeep related data togetherLook at storing summaries of data Approach used by analytics companies: granularity changes over time: This week: per second data Last week: Hourly summaries Last month: Daily summaries Last year: Monthly summaries
    • 100. Thanks; Q+AMichael Peacockmkpeacock@gmail.com@michaelpeacockwww.michaelpeacock.co.uk
    • 101. Photo creditsflickr.com/photos/itmpa/4531956496/flickr.com/photos/eveofdiscovery/3149008295flickr.com/photos/gadl/89650415/flickr.com/photos/brapps/403257780

    ×