Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010

  • 2,357 views
Uploaded on

Hadoop Summit - Research Track …

Hadoop Summit - Research Track
Parallel Distributed Image Stacking and Mosaicing with Hadoop
Keith Wiley, University of Washington

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,357
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Astronomical Image Processing with Hadoop Keith Wiley* Survey Science Group Dept. of Astronomy, Univ. of Washington *Keith Wiley, Andrew Connolly, YongChul Kwon, Magdalena Balazinska, Bill Howe, Jeffrey Garder, Simon Krughoff, Yingyi Bu, Sarah Loebman and Matthew Kraus 1
  • 2. Session Agenda  Astronomical Survey Science  Image Coaddition  Implementing Coaddition within MapReduce  Optimizing the Coaddition Process  Conclusions  Future Work 2
  • 3. Astronomical Topics of Study  Dark energy  Large scale structure of universe  Gravitational lensing  Asteroid detection/tracking 3
  • 4. What is Astronomical Survey Science?  Dedicated sky surveys, usually from a single calibrated telescope/camera pair.  Run for years at a time.  Gather millions of images and TBs of storage*.  Require high-throughput data reduction pipelines.  Require sophisticated off-line data analysis tools. * Next generation surveys will gather PBs of 4image data.
  • 5. Sky Surveys: Today and Tomorrow  SDSS* (1999-2005)  LSST† (2015-2025)  Founded in part by UW  8.4m mirror, 3.2 gigapixel camera  1/4 of the sky  Half sky every three nights  80TBs total data  30TB per night... ...one SDSS every three nights  60PBs total (nonstop ten years)  1000s of exposures of each location * Sloan Digital Sky Survey That’s a person! 5 † Large Synoptic Survey Telescope
  • 6. FITS (Flexible Image Transport System)  Common astronomical image representation file format  Metadata tags (like EXIF): › Most importantly: Precise astrometry* › Other: • Geolocation (telescope location) • Sky conditions, image quality, etc.  Bottom line: › An image format that knows where it is looking. * Position on sky 6
  • 7. Image Coaddition  Give multiple partially overlapping images and a query (color and sky bounds): › Find images’ intersections with the query bounds. › Project bitmaps to the bounds. › Stack and mosaic into a final product. 7
  • 8. Image Coaddition Expensive  Give multiple partially overlapping images and a query (color and sky bounds): › Find images’ intersections with the query bounds. › Project bitmaps to the bounds. › Stack and mosaic into a final product. Cheap 8
  • 9. Image Stacking (Signal Averaging)  Stacking improves SNR: › Makes fainter objects visible.  Example (SDSS, Stripe 82): › Top: Single image, R-band › Bottom: 79-deep stack • (~9x SNR improvement) • Numerous additional detections  Variable conditions (e.g., atmosphere, PSF, haze) mean stacking algorithm complexity can exceed a mere sum. 9
  • 10. Coaddition in Hadoop Input FITS image Mapper Detect Projected intersection with intersection query bounds. Project bitmap to query’s coord sys. Reducer Final coadd Stack and mosaic HDFS Mapper projected intersections. Mapper Mapper Parallel Serial 10
  • 11. Driver Prefiltering  To assist the process we prefilter the FITS files in the driver.  SDSS camera has 30 CCDs: › 5 colors › 6 abutting strips of sky Query bounds Relevant FITS  Prefilter (path glob) by color Prefilter-excluded FITS False positives 1 FITS and sky coverage (single axis): › Exclude many irrelevant FITS files. › Sky coverage filter is only single axis: • Thus, false positives slip through... ...to be discarded in the mappers. 11
  • 12. Performance Analysis 45  Running time: 40 › 2 query sizes 35 › Run against 30 1/10th of SDSS 25 (100,058 FITS) Minutes 20 15  Conclusion: 10 › Considering the 5 small dataset, this is too slow! 0 1° sq (3885 FITS) ¼° sq (465 FITS) Query Sky Bounds Extent Error bars show 95% confidence intervals — Outliers removed via Chauvenet › Remember 42 minutes for the next slide. 12
  • 13. Performance Analysis Driver MapReduce runJob() main()  Breakdown of large 40 query running time 30  main() is sum of: › Driver Minutes 20 › runJob() 10  runJob() is sum of 0 Degl Cons Mapp R runJo main MapReduce parts. ob In put P truct File er Do educer D b() () aths Splits ne one Hadoop Stage 13
  • 14. Performance Analysis Total job time  Breakdown of large RPCs from 40 client to HDFS query running time 30  Observation: › Running time Minutes 20 dominated by RPCs from client to HDFS to process 1000s of 10 FITS file paths. 0 Degl Cons Mapp R runJo main ob In put P truct File er Do educer D b() () aths Splits ne one Hadoop Stage  Conclusion: › Need to reduce number of files. 14
  • 15. Sequence Files  Sequence files group many small files into a few large files.  Just what we need!  Real-time images may not be amenable to logical grouping. › Therefore, sequence files filled in an arbitrary manner: FITS FITS filename assignment to Seq Sequence FITS Hash Function FITS Sequence FITS 15
  • 16. Performance Analysis 45 FITS Seq hashed  Comparison: 40 › FITS input vs. 35 unstructured 30 sequence file 25 input* Minutes 20 15  Conclusion: › 5x speedup! 10 5 0  Hmmm... 1° sq (3885 FITS) Query Sky Bounds Extent ¼° sq (465 FITS) Can we do better? Error bars show 95% confidence intervals — Outliers removed via Chauvenet * 360 seq files in hashed seq DB. 16
  • 17. Performance Analysis 45 Processed a subset FITS Seq hashed  Comparison: 40 of the database › FITS input vs. after prefiltering 35 unstructured 30 sequence file 25 input* Minutes 20  Conclusion: 15 Processed the entire › 5x speedup! 10 database, but 5 still ran faster 0  Hmmm... 1° sq (3885 FITS) Query Sky Bounds Extent ¼° sq (465 FITS) Can we do better? Error bars show 95% confidence intervals — Outliers removed via Chauvenet * 360 seq files in hashed seq DB. 17
  • 18. Structured Sequence Files  Similar to the way we prefiltered FITS files...  SDSS camera has 30 CCDs: › 5 colors › 6 abutting strips of sky › Thus, 30 sequence file types Query bounds Relevant FITS Prefilter-excluded FITS False positives 1 FITS  Prefilter by color and sky coverage (single axis): › Exclude irrelevant sequence files. › Still have false positives. › Catch them in the mappers as before. 18
  • 19. Performance Analysis 45 FITS Seq hashed  Comparison: 40 Seq grouped, prefiltered › FITS vs. 35 unstructured 30 sequence* vs. 25 structured Minutes sequence files† 20 15  Conclusion: 10 › Another 2x 5 speedup for the 0 1° sq (3885 FITS) ¼° sq (465 FITS) large query, 1.5x Query Sky Bounds Extent Error bars show 95% confidence intervals — Outliers removed via Chauvenet speedup for the small query. * 360 seq files in hashed seq DB. † 1080 seq files in structured DB. 19
  • 20. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Breakdown of large 9 query running time 8 7  Prediction: 6 › Prefiltering should Minutes 5 gain performance in the mapper. 4 3 2  Does it? 1 0 Degl Cons Mapp R runJo main ob In put P truct File er Do educer D b() () aths Splits ne one Hadoop Stage 20
  • 21. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Breakdown of large 9 query running time 8 7  Prediction: 6 › Prefiltering should Minutes 5 gain performance in the mapper. 4 3 2  Conclusion: 1 › Yep, just as 0 Degl ob In Cons Mapp put P truct File aths Splits R er Do educer D ne runJo one b() main () expected. Hadoop Stage 21
  • 22. Performance Analysis 45 FITS Seq hashed  Experiments were 40 Seq grouped, prefiltered performed on a 35 100,058 FITS 30 database (1/10th SDSS). 25 Minutes 20  How much of this 15 database is 10 Hadoop churning 5 through? 0 1° sq (3885 FITS) ¼° sq (465 FITS) Query Sky Bounds Extent Error bars show 95% confidence intervals — Outliers removed via Chauvenet 22
  • 23. Performance Analysis 45 FITS Seq hashed  Comparison: 40 Seq grouped, prefiltered › Number of FITS 13,415 FITS files 35 499 mappers‡ considered in 30 mappers vs. 25 100,058 FITS files number Minutes 360 sequence files* contributing to 20 2077 mappers‡ coadd 15 13,335 FITS files  Conclusion: 10 144 sequence files† 5 341 mappers‡ › Mappers must 0 3885 FITS Sky Bounds Extent 1° sq (3885 FITS) ¼° sq (465 FITS) discard many Query Error bars show 95% confidence intervals — Outliers removed via Chauvenet FITS files due to nonoverlap of * 360 seq files in hashed seq DB. query bounds. † 1080 seq files in structured DB. ‡ 800 mapper slots on cluster. 23
  • 24. Using SQL to Find Intersections  Store all image colors and sky bounds in a database: › First, query color and intersections via SQL. › Second, send only relevant images to MapReduce.  Consequence: All images processed by mappers contribute to coadd. No time wasted considering irrelevant images. SQL (1) Retrieve Database FITS filenames that overlap query bounds Driver (2) Only load relevant MapReduce images into MapReduce 24
  • 25. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Comparison: 9 SQL ! Seq hashed SQL ! Seq grouped › nonSQL vs. SQL 8 7 6  Conclusion: › Sigh, no major Minutes 5 4 improvement (SQL is not 3 remarkably 2 superior to 1 nonSQL for given 0 1° sq (3885 FITS) ¼° sq (465 FITS) pairs of bars). Query Sky Bounds Extent 25
  • 26. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Comparable 9 SQL ! Seq hashed SQL ! Seq grouped performance here 8 makes sense: › In essence, 7 6 prefiltering and Minutes 5 SQL performed 4 similar tasks, 3 albeit with 3.5x 2 100058 3885 Num mapper 100058 465 different mapper 1 13335 3885 13335 3885 input records (FITS) 6674 465 inputs (FITS). 2077 1714 2149 499 341 338 Num launched mappers 144 128 0 1° sq (3885 FITS) ¼° sq (465 FITS) Query Sky Bounds Extent  Conclusion: › Cost of discarding many images in the nonSQL case 26 was negligible.
  • 27. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Low improvement 9 SQL ! Seq hashed SQL ! Seq grouped for SQL in the 8 hashed case is 7 surprising at first 6 › ...especially Minutes 5 considering 26x 4 different mapper 3 inputs (FITS). 2 100058 3885 100058 3885 Num mapper input records (FITS) 100058 465 13335 3885 6674 465 1 2077 1714 2149 499 341 338 Num launched mappers 144 128 0 1° sq (3885 FITS) ¼° sq (465 FITS) Query Sky Bounds Extent 27
  • 28. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Low improvement 9 SQL ! Seq hashed SQL ! Seq grouped for SQL in the 8 hashed case is 7 surprising at first 6 › ...especially Minutes 5 considering 26x 4 different mapper 3 inputs (FITS). 2 100058 3885 100058 3885 Num mapper input records (FITS) 100058 465 13335 3885 6674 465 1  Theory: 2077 1714 338 1714 Num launched mappers 2149 499 0 341 338 144 128 › Scattered 1° sq (3885 FITS) ¼° sq (465 FITS) Query Sky Bounds Extent distribution of relevant FITS ently Mapper requirements exceeded Consequ prevented cluster capacity(~800). We lost efficient mapper parallelism (mappers were queued)! 28 reuse.
  • 29. Results 10 Seq hashed Seq grouped, prefiltered  Just to be clear: 9 SQL ! Seq hashed SQL ! Seq grouped 8 7 › Prefiltering improved due to 6 reduction of Minutes 5 mapper load. 4 3 › SQL improved 2 100058 13335 3885 3885 Num mapper input records (FITS) 100058 6674 465 465 due to data 1 2077 1714 Num launched mappers 2149 499 locality and more 0 341 1° sq (3885 FITS) 338 144 ¼° sq (465 FITS) 128 efficient mapper Query Sky Bounds Extent allocation – the required work was unaffected (3885 FITS). 29
  • 30. Utility of SQL Method  Despite our results (which show SQL to be equivalent to prefiltering)...  ...we predict that SQL should outperform prefiltering on larger databases.  Why? › Prefiltering would contend with an increasing number of false positives in the mappers*. › SQL would incur little additional overhead.  No experiments on this yet. * A spacing-filling curve for grouping the data might help. 30
  • 31. Conclusions  Packing many small files into a few large files is essential.  Structured packing and associated prefiltering offers significant gains (reduces mapper load).  SQL prefiltering of unstructured sequence files yields little improvement (failure to combine scattered HDFS file-splits leads to mapper bloat).  SQL prefiltering of structured sequence files performs comparably to driver prefiltering, but we anticipate superior performance on larger databases.  On a shared cluster (e.g. the cloud), performance variance is high – doesn’t bode well for online applications. Also makes precise performance profiling difficult. 31
  • 32. Future Work  Parallelize the reducer.  Less conservative CombineFileSplit builder.  Conversion to C++, usage of existing C++ libraries.  Query by time-range.  Increase complexity of projection/interpolation: › PSF matching  Increase complexity of stacking algorithm: › Convert straight sum to weighted sum by image quality.  Work toward the larger science goals: › Study the evolution of galaxies. › Look for moving objects (asteroids, comets). › Implement fast parallel machine learning algorithms for detection/classification of anomalies. 32
  • 33. Questions? kbwiley@astro.washington.edu 33