Your SlideShare is downloading. ×
0
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
VO Course 10: Big data challenges in astronomy
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

VO Course 10: Big data challenges in astronomy

355

Published on

How future astronomy projects will generate enormous amounts of data, and what does that mean for astronomical data processing. Part of the virtual observatory course by Juan de Dios Santander Vela, …

How future astronomy projects will generate enormous amounts of data, and what does that mean for astronomical data processing. Part of the virtual observatory course by Juan de Dios Santander Vela, as imparted for the MTAF (Métodos y Técnicas Avanzadas en Física, Advanced Methods and Techniques in Physics) Master at the University of Granada (UGR).

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
355
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Astronomy’s Big DataChallengesJuan de Dios Santander Vela (IAA-CSIC)
  2. OverviewWhat is, exactly, big data?Which are the dimensions of big data?Which are the big data drivers in astronomy?How can we deal with big data?VO tools for dealing with big data
  3. What is exactly Big Data? Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. WIKIPEDIA: “BIG DATA”
  4. What is exactly Big Data?Big Data is data with at least one Big dimension Bandwidth Number of individual assets Size of individual assets Response speed …
  5. Data mining Processing techniques Offline Storage Size Flow Access techniques Big Real time Data Event Processi ng Processing Paralell Files level AccessRaw Data Schemata Capabilities Unstructured Durability Processed Formats Data Statistics Value Stuctured Tagging Information Tech Debt Extracted
  6. Next big data projects inastronomy
  7. Large Synoptic SurveyTelescope
  8. The Large Synoptic Survey Telescope Camera Steven M. Kahn Stanford/SLAC (for the LSST Consortium)
  9. LSST Data Rates* 2.3 billion pixels read out in less than 2 sec, every 12 sec* 1 pixel = 2 Bytes (raw)* Over 3 GBytes/sec peak raw data from camera* Real-time processing and transient detection: < 10 sec* Dynamic range: 4 Bytes / pixel* > 0.6 GB/sec average in pipeline* 5000 floating point operations per pixel* 2 TFlop/s average, 9 TFlop/s peak* ~ 18 Tbytes/night
  10. Relative Survey Power
  11. Square Kilometre Array
  12. Signal Transport & Processing
  13. Signal Transport & Processing DESIGNS COUNTS!
  14. Massive Data Flow, Storage & Processing Antenna & Front End Systems STORAGE? Correlation CAN’T STORE IT! 1 DAY STREAM = 150 DAYS GLOBAL INTERNET TRAFFICData Product Generation Temporary 800 PB Storage On-Demand Long Term High Availability Processing Storage Storage / DB18 PB/YEAR
  15. Massive Data Flow, Storage & Processing Antenna & Front End Systems PROCESSING NEEDSCorrelation > 1 EXAFLOP/S 109 TOP RANGE PCSData Product Generation Temporary 30 PETAFLOPS/S Storage On-Demand Long Term High Availability Processing Storage Storage / DB
  16. Massive Data Flow, Storage & Processing Antenna & Front End Systems 7 PB/SCorrelation BANDWIDTH TYPICAL SURVEY, > 300 GB/S 5 DAYS READ TIME @Data Product 10GB/SEC Generation Temporary Storage On-Demand Long Term High Availability Processing Storage Storage / DB
  17. MASSIVE DATA FLOW, STORAGE & PROCESSINGAntenna &Front End SystemsCorrelation Bandwidth)in)TB/s) LOFAR" ALMA" 0" 5" 10" 15" 20" 25" 30" 35" 40"
  18. MASSIVE DATA FLOW, STORAGE & PROCESSINGAntenna &Front End Systems Bandwidth)in)TB/s)Correlation Bandwidth)in)TB/s) ASKAP" LOFAR" 0" 10" 20" 30" 40" 50" 60" 70" ALMA" 0" 5" 10" 15" 20" 25" 30" 35" 40"
  19. MASSIVE DATA FLOW, STORAGE & PROCESSINGAntenna &Front End Systems Bandwidth)in)TB/s)Correlation Bandwidth)in)TB/s) ASKAP" LOFAR" 0" 10" 20" 30" 40" 50" 60" 70" ALMA" 0" 5" 10" 15" 20" 25" 30" 35" 40"
  20. MASSIVE DATA FLOW, STORAGE & PROCESSINGCorrelation Processing*TFlops/s* ALMA" VLA" 0" 0,0005" 0,001" 0,0015" 0,002"
  21. MASSIVE DATA FLOW, STORAGE & PROCESSINGCorrelation Processing*TFlops/s* LOFAR" Processing*TFlops/s* ALMA" ALMA" 0" 20" 40" 60" 80" 100" 120" VLA" 0" 0,0005" 0,001" 0,0015" 0,002"
  22. MASSIVE DATA FLOW, STORAGE & PROCESSING Processing*TFlops/s*Correlation Processing*TFlops/s* ASKAP" LOFAR" Processing*TFlops/s* 0" 50" 100" 150" 200" 250" 300" 350" ALMA" ALMA" 0" 20" 40" 60" 80" 100" 120" VLA" 0" 0,0005" 0,001" 0,0015" 0,002"
  23. MASSIVE DATA FLOW, STORAGE & PROCESSING Processing*TFlops/s*Correlation Processing*TFlops/s* ASKAP" LOFAR" Processing*TFlops/s* 0" 50" 100" 150" 200" 250" 300" 350" ALMA" ALMA" 0" 20" 40" 60" 80" 100" 120" VLA" 0" 0,0005" 0,001" 0,0015" 0,002"
  24. Comparison: LHC
  25. CERN/IT/DB 40 Monline system Hz leve (40multi-level trigger l1 TB/filter out background 75 K - spe sec) cial l 2 - Hz (7reduce data volume from leve hard40TB/s to 100MB/s 5G war 5 K embedde B/sec) e Hz ( d pr o leve 5G cess l3 - B/se ors c) 100 PCs (100 MB z H data /sec ) offli reco ne a rding naly & sis
  26. CERN/IT/DB Event Filter & Reconstruction (figures are for one experiment) data from detector - event builder switch input: 5-100 GB/sec capacity: 50K SI95 computer farm (~4K 1999 PCs) recording rate: 100 MB/sec (Alice – 1 GB/sec) high speed network tape and disk servers raw sum dat ma a ry d ata+ 1-1.25 PetaByte/year + 1-500 TB/year20,000 Redwood cartridges every year (+ copy)
  27. Dealing with Big Data
  28. Dealing with Big DataWe cannot allow for arbitrary queries We can have arbitrary processing insteadWe cannot allow full data dumps We can generate data on the the fly (see above)
  29. Queries as functions QUERY = FUNCTION { } DATA QUERIES NEED TO BE PRECOMPUTED ARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS
  30. Queries as functions QUERY = FUNCTION { } ALL DATA QUERIES NEED TO BE PRECOMPUTED ARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS
  31. Lambda Architecture FAST, INCREMENTAL ALGOS. Speed Layer QUERIES NOT ON BATCH L. COMPENSATES FOR LATENCYRANDOM ACCESS TO VIEWS Serving LayerUPDATED BY BATCH LAYER STORE MASTER DATASET Batch Layer COMPUTE ARBITRARY VIEWS
  32. Batch Layer INMUTABLE, CONSTANTLYStores master copy of the dataset GROWINGPrecomputes batch views on that master dataset INMUTABLE, CONSTANTLY GROWING
  33. Batch Layer UPDATED VIEWS TYPICALLY, View 1 MAP/REDUCE All Data Batch Layer View 2 NEW DATA … View n
  34. Serving LayerAllows for: batch writes of view updates random reads on the viewsDoes not allow random writes
  35. Speed LayerAllows for: incremental writes of view updates short-term temporal queries on the viewsCan be discarded!
  36. 27Figure 2.1 The master dataset in the Lambda Architecture serves as the source oftruth of your Big Data system. Errors at the serving and speed layers can be
  37. Computing overBig DataBatch layer as a computational engine on dataNeed to formally specify Inputs IKE KS L ! Processes LOO LOW T HAT RKF A WO Outputs SQL OR ING QU ERY
  38. Map/Reduce
  39. Map/Reducefrom%random%import%normalvariatefrom,math,import,sqrtdef,res2(x):,return,pow(mean_v,6,x,,2.)#"Random"vector,"mean"1,"stdev"0.001v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)res2_v,=,map(res2,,v)stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))print,(mean_v,,stdev) PARALELLISABLE!
  40. Map/Reducefrom%random%import%normalvariatefrom,math,import,sqrtfrom,multiprocessing,import,Pooldef,res2(x):,return,pow(mean_v,6,x,,2.)#"Random"vector,"mean"1,"stdev"0.001v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)pool,=,Pool(processes=4) ONLY FOR MAP, BUT REDUCEres2_v,=,pool.map(res2,,v) ALSO PARALLELISABLEpool.close()stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))print,(mean_v,,stdev)
  41. Dependence of execution time with the number of pool processors 0,8 20 millions 10 millions 5 millionsseconds per million elements 1 million 0,7 0,6 0,5 0,4 1 2 3 4 5 6 7 8 Number of pool processors
  42. ConclusionsBig data needs different approaches Parallelism & data-side processing Map/Reduce as parallelism engineNeed of ways to formally specify computations
  43. References & Links“The Fourth Paradigm: Data-Intensive ScientificDiscovery”, Jim Gray, Microsoft Research“MapReduce: Simplified Data Processing on Large Clusters”, Jeffrey Dean and Sanjay Ghemawat,GoogleMyExperiment

×