Sc12 workshop-writeup

SUNDAY:

HPC databases workshop:

rasdman:

• adding arrays to SQL queries
• array query operators
• general array contstructor
• subset trim & slice
• array nest/unest
• matrix multiplication
• histograms
• formal encoding (e.g. c, cpp, java arrays)
• nested queries
• storage mapping: variants
• coordinate-free sequence
• BLOBs
• ROLAP
• imaging multidimensional OLAP
• tiled array storage
• regular
• directonal
• area of interest
• In-Situ Databases
• approach: reference external files
• related: SciQL
• adding tertiary storage
• tapes
• problem: spatial clustering
• approach: super-tiles = all of the particular index nodes (reiner 2001 - paper)
• Query processing
• optimization 1: query rewriting
• optimization 2: JIT compilation
• approach: cluster suitable ops
• compile & dynamically bind
• benefit: speed up complex, repeated operations
• variation: compile code for GPU
• Intra operator parallelization
• ...too fast
• query processing in a federation
• query splitting
• work in progress
• examples
• human brain imaging
• gene expression analysis (db queries, sexy as fuck) -> output jpeg, correlations,..
• geo service standardization (OGC, SIC)
• use cases/ e.g.:
• sat imageing
• 3d clients/vis.
• historhy of array DBMSs
• array as table

• conclusion
• awesome for science and so on..

NEEEEEED SLIDES. so much enhanced SQL statement examples.

Energy Efficient HPC:

VERY much information via slides and talk, graphs,..
extremely interesting. you should read the slides yourself, if you are interested:
http://eehpcwg.lbl.gov/documents

Data-aware networking workshop:

gridftp (fatih university - TR):

https://sites.google.com/a/lbl.gov/ndm2012/home/accepted-papers (first one)

• intro: pipelining, parallelism, concurrency
• pipelining:
• useful for large number of small files
• higher throughputs on small files (1MB)
• nr. of files affects total throughput but not the optimal pipelining level
• throughput increases as number of files increases,..
• BDP = BW*RTT - optimal windowsize (pfo)
• ....
• parallelism:
• when buffer size is too small comparing to the BDP
• adventagous with large files
• concurrency:
• advantages over parallelism:
• para. deteriorates the performance w. small files (pipelining)
• concurrency + pipelining has better perf. than cc+pp+p
• small RTT: quicker acend to the peak trhoughput
• ...
• rules of thumb:
• always use pipelining
• set diffrent levels
• keep chunks as big as possible
• use concurrency with pipelining w. small files and small # files
• add parallelism to cc and pp with bigger filess
• use parallelism when # files is insufficient to feed BDP
• recursive chunk size division
• mean based algo. to construct cluster of files with diff. optimal pipelining lvls.
• calc.optimal pipelining level by dividing BDP into mean file size of chunk
• results
• awesome (slides needed, graphs and so on,..)

Sandhya Narayan, Hadoop acceleration in an OpenFlow-based cluster:

• overview of SDN/openflow
• use case: hadoop

• hadoop overview
• hadoop acceleration approaches (usual stuff)
• overview mapreduce pipeline (ibid)
• overview of hadoop network traffic (ibid)
• floodlight as openflow controller
• openflow switch: openvswitch and link (research link)
• queues in openflow (for different bandwidths 50mbps, 200mbps,..)
• improvement in latency due to BW queues
• conclusion: SDN is awesome, but we don't use much of it now.
• further work: QoS, dynamic hadoop flows

no news there.

Mehmet Balman, Streaming Exa Scale data over 100Gbps Networks:

• lot-of-small files problem! - file centric tools (not high speed), latency still a problem
• framework for memeory-mapped network channel
• blocks
• memory caches are logically mapped between client and server
• advantages:
• decoupling i/o and network ops (front/backend)
• not limited by file size characteristics
• moving climate files efficiently (gridftp, fopen,..)
• SC11 100Gbps demo
• CMIP3 data (35tb) over gpfs at NERSC
• bs 4MB
• each blocks data section was alined according to the system page size
• 1gb cache
• testbed overview:
• many tcp streams
• effects: crazy cpu usage
• memznet's performance (buffer size 5mb)

wtf?! no new information AT ALL.

MONDAY:

parallel storage workshop:

keynote (eric barton)

• http://www.pdsw.org/keynote.shtml
• http://www.pdsw.org/pdsw12/slides/keynote-FF-IO-Storage.pdf

poster sessions
slides and papers available online: http://www.pdsw.org/index.shtml

slides (papers if no slides available at the time):
1. http://www.pdsw.org/pdsw12/papers/he-pdsw12.pdf
2. http://www.pdsw.org/pdsw12/slides/crume-slides-pdsw12.pdf

3. http://www.pdsw.org/pdsw12/papers/grawinkle-pdsw12.pdf - no slides yet
4. http://www.pdsw.org/pdsw12/papers/kim-pdsw12.pdf - no slides yet
5. http://www.pdsw.org/pdsw12/slides/jwchoi_sc_SAN.pdf
6. http://www.pdsw.org/pdsw12/slides/ren-tablefs_giga_pdsw.pdf
7. http://www.pdsw.org/pdsw12/papers/goodell-pdsw12.pdf - no slides yet
8. http://www.pdsw.org/pdsw12/slides/watkins-datamods-pdsw12.pdf
9. http://www.pdsw.org/pdsw12/papers/carns-pdsw12.pdf - no slides (yet?)

HFT workshop:

http://www.cs.usfca.edu/~mfdixon/whpcf12/whpcf_12_program.html

2nd keynote - nvidia (john ashley) - how not to be roadkill

• overview
• background: EE, realtime data, big data, datamining, geospatial,..
• drivers - power and heat
• drivers - financial regulators
• drivers the world as we dont knot it:
• no arch. for everything, multi-arch
• hadoop isnt the answer to everything
• need to optimize cost and risk
• need tools and techniques to implement across heterogenous solutions
• need metrics to identfiy tradeoffs
• example:
• hanweck - reduced capt. expen. 10x, oper. expen. 13x
• citadel - each gpu saves 180.6K USD / year
• JPMC - 80 percent oper. expen. savings through GPUs
• drivers - information advantage
• is knowledge power?
• profit = f(knowledge, cap., capability)
• low latency/hft teams know this,..
• knowing what your competition does
• are you in the red with respect to capability to price and risk deals,..
• analytical? better models?, faster?
• computionally? new technology -> time to market
• JPMorgan runs GPUs for risk analysis
• crossing the road w/o getting hit
• techonolgy
• no longer hw agnostic
• heterogenous
• suitable
• data is the new bottleneck
• skills
• parallel thinking
• data awareness
• multi-paragidgm, multi-programming
• experimentalism
• hft guys are into all of this and so on,...
• parallel thinking
• chunking work
• distribution

• tiling
• cyclic reduction, parallel solvers, swarm optimization, monte carlo
• numerical issues
• awareness of descrete math issues, SP/DP
• numerical stability, async. algos, red/black coloring, multi-level grid solvers
• data awareness
• not just hadoop
• efficient organization, delivery of data to compute is key
• dataflow programming is key
• hpc programmers already know this
• examples:
• structure of arrays vs array of structures, esp. as vector units get wider
• tiling algos. vs naive algos drastically improve performance
• some firms still believe that language optimized and hardware aware programming
is wrong
• experimentalism
• innovate
• avoid analysis paralysis
• define relevant metrics, collect them, and then act
• STAC-A2: a benchmark focused on metrics and biz problem
• can be used to compare a range of potential solutions that are innovative
• allows free eign to parallel and data-sensitive computing
• case study
• CARMA: standalone arm + gpu micro server, its a dev. kit, over narrow pci-e
• monte carlo based
• MPI
• carma rocks for hft
• speed
• low power consumption

Sc12 workshop-writeup

More Related Content

What's hot

Similar to Sc12 workshop-writeup

More from Aaron Zauner

Recently uploaded

Sc12 workshop-writeup