Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Apache Spark and MySQL for Data Analysis

2,141 views

Published on

Slides for Fosdem-2017. Author: Alexander Rubin. Speaker: Sveta Smirnova

Published in: Software
  • Be the first to comment

Using Apache Spark and MySQL for Data Analysis

  1. 1. Using Apache Spark and MySQL for Data Analysis Alexander Rubin, Sveta Smirnova Percona February, 4, 2017
  2. 2. www.percona.com Agenda • Why Spark? • Spark Examples – Wikistats analysis with Spark
  3. 3. www.percona.com Data / SQL / Protocol SQL/ App What is Spark anyway? Nodes Parallel Compute only Local FS ?
  4. 4. www.percona.com • In memory processing with caching • Massively Parallel • Direct access to data sources (i.e.MySQL) >>> df = sqlContext.load(source="jdbc", url="jdbc:mysql://localhost?user=root", dbtable="ontime.ontime_sm”) • Can store data in Hadoop HDFS / S3 / local Filesystem • Native Python and R integration Why Spark?
  5. 5. www.percona.com Spark vs MySQL
  6. 6. www.percona.com Spark vs. MySQL for BigData Indexes Partitioning “Sharding” Full table scan Partitioning Map/Reduce
  7. 7. www.percona.com Spark (vs. MySQL) • No indexes • All processing is full scan • BUT: distributed and parallel • No transactions • High latency (usually) MySQL: 1 query = 1 CPU core
  8. 8. www.percona.com Indexes (BTree) for Big Data challenge • Creating an index for Petabytes of data? • Updating an index for Petabytes of data? • Reading a terabyte index? • Random read of Petabyte? Full scan in parallel is better for big data
  9. 9. www.percona.com ETL / Pipeline 1. Extract data from external source 2. Transform before loading 3. Load data into MySQL 1. Extract data from external source 2. Load data or rsync to all spark nodes 3. Transform data/Analyze data/Visualize data; Parallelism
  10. 10. www.percona.com Schema on Read Schema on Write • Load data infile will verify the input (validate) • … indirect data conversion • ... or fail if number of cols is wrong Schema on Read • No “load data” per se, nothing to validate here • … Create external table or read csv • ... will validate on “read”/ select
  11. 11. www.percona.com Example: Loading wikistat into MySQL 1. Extract data from external source and uncompress! 2. Load data into MySQL and Transform Wikipedia page counts – download, >10TB load data local infile '$file' into table wikistats.wikistats_full CHARACTER SET latin1 FIELDS TERMINATED BY ' ' (project_name, title, num_requests, content_size) set request_date = STR_TO_DATE('$datestr', '%Y%m%d %H%i%S'), title_md5=unhex(md5(title)); http://dumps.wikimedia.org/other/pagecounts-raw/
  12. 12. www.percona.com Load timing per hour of wikistat • InnoDB: 52.34 sec • MyISAM: 11.08 sec (+ indexes) • 1 hour of wikistats =1 minute • 1 year will load in 6 days – (8765.81 hours in 1 year) • 6 year = > 1 month to load Not even counting the insert time degradation…
  13. 13. www.percona.com Loading wikistat as is into Spark • Just copy files to storage (AWS S3 / local / etc)… – And create SQL structure • Or read csv, aggregate/filter in Spark and – load the aggregated data into MySQL
  14. 14. www.percona.com Loading wikistat as is into Spark • How fast to search? – Depends upon the number of nodes • 1000 nodes spark cluster – 4.5 TB, 104 Billion records – Exec time: 45 sec – Scanning 4.5TB of data • http://spark-summit.org/wp-content/uploads/2014/07/Building- 1000-node-Spark-Cluster-on-EMR.pdf
  15. 15. www.percona.com Pipelines: MySQL vs Spark
  16. 16. www.percona.com Spark and WikiStats: load pipeline Row(project=p[0], url=urllib.unquote(p[1]).lower(), num_requests=int(p[2]), content_size=int(p[3])))
  17. 17. www.percona.com Save results to MySQL group_res = sqlContext.sql( "SELECT '"+ mydate + "' as mydate, url, count(*) as cnt, sum(num_requests) as tot_visits FROM wikistats GROUP BY url") # Save to MySQL mysql_url="jdbc:mysql://localhost?user=wikistats&password= wikistats” group_res.write.jdbc(url=mysql_url, table="wikistats.wikistats_by_day_spark", mode="append")
  18. 18. www.percona.com Multi-Threaded Inserts
  19. 19. www.percona.com PySpark: CPU Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st Cpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st ... Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers
  20. 20. www.percona.com Monitoring your jobs
  21. 21. www.percona.com
  22. 22. www.percona.com mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10; +--------------------------------------------------------+------------+----------+ | lurl | max_visits | count(*) | +--------------------------------------------------------+------------+----------+ | heath_ledger | 4247338 | 131 | | cloverfield | 3846404 | 131 | | barack_obama | 2238406 | 153 | | 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 | | the_dark_knight_(film) | 1417186 | 64 | | martin_luther_king,_jr. | 1394934 | 136 | | deaths_in_2008 | 1372510 | 67 | | united_states | 1357253 | 167 | | scientology | 1349654 | 108 | | portal:current_events | 1261538 | 125 | +--------------------------------------------------------+------------+----------+ 10 rows in set (1 hour 22 min 10.02 sec) Search the WikiStats in MySQL 10 most frequently queried wiki pages in January 2008
  23. 23. www.percona.com Search the WikiStats in SparkSQL spark-sql> CREATE TEMPORARY TABLE wikistats_parquet USING org.apache.spark.sql.parquet OPTIONS ( path "/ssd/wikistats_parquet_bydate" ); Time taken: 3.466 seconds spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10; heath_ledger 4247335 42 cloverfield 3846400 42 barack_obama 2238402 53 1925_in_baseball#negro_league_baseball_final_standings 1791341 11 the_dark_knight_(film) 1417183 36 martin_luther_king,_jr. 1394934 46 deaths_in_2008 1372510 38 united_states 1357251 55 scientology 1349650 44 portal:current_events 1261305 44 Time taken: 1239.014 seconds, Fetched 10 row(s) 10 most frequently queried wiki pages in January 2008 20 min
  24. 24. www.percona.com Apache Drill Treat any datasource as a table (even it is not) Querying MongoDB with SQL
  25. 25. www.percona.com Magic? !=
  26. 26. www.percona.com Recap… 1. Search full dataset • May be pre-filtered • Not aggregated 2. No parallelism 3. Based on index? 4. InnoDB<> Columnar 5. Partitioning? 1. Dataset is already – Filtered (only site=“en”) – Aggregated (group by url) 2. Parallelism (+) 3. Not Based on index 4. Columnar (+) 5. Partitioning (+)
  27. 27. www.percona.com Thank you! https://www.linkedin.com/in/alexanderrubin Alexander Rubin

×