Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

1,506 views

Published on

No Downloads

Total views

1,506

On SlideShare

0

From Embeds

0

Number of Embeds

206

Shares

0

Downloads

29

Comments

0

Likes

8

No embeds

No notes for slide

You’re not going to see this stuff because this is not a serious MongoDB talk.

Also, Math

Open source Python packages that can analyze & visualize data from MongoDB

specialized MongoDB driver that can parse almost a million documents per second

But this isn’t a serious talk because there won’t be any cylinders.

If you came for cylinders, I don’t want you to be disappointed.

We downloaded 2.5 billion weather measurements from the US Government.

The stations do have cylinders, does that mean they’re databases?

Stations have various frequencies: once per day, twice, hourly, every 5 minutes, ….

André showed how you can choose the price-performance tradeoff that’s right for you:

Single-server.

Massively sharded cluster.

I went with the single-server option.

Global air temperature each hour in December last year.

The remainder of this talk is going to discuss:

open source Python packages

algorithms

performance issues.

the code to do all this is quite simple

The work’s all been done for me.

pymongo represents bson documents as python dicts

we take the values from each dict and put it in a python array

then we convert the python array to a numpy array

how do we make the contour plot? we have to interpolate among these points to come up with a temperature map of the whole globe

But first notice all the white areas.

The arrangement is very uneven.

Comes up with a set of non-overlapping triangles for all the points.

This is called Barycentric Interpolation, use that at the cocktail party later on.

Brings us from this…

So we can discard our original samples now and just use the grid.

But notice we can only contour the spaces between stations. There’s no way to know about the edges.

So that gets us from this map of just the stations.

So I came up with a hack to fill in the rest of the space.

It doesn’t know that if you keep heading North from the United States you end up in Russia.

TODO: the movie again!

So my program just re-executes the process once for each hour’s worth of data, for the whole month of December.

But it’s a little slow, takes almost a second to generate each frame, which means that creating a minute-long movie might take 10 minutes of rendering time.

It’s fast, but can we go faster?

Directly from MongoDB to NumPy, all written in C.

No intermediate Python dictionaries.

Written by David Beach, a financial analyst. Just an open source project by a MongoDB community member.

You get numpy arrays back, directly, with no further processing

PyMongo takes 0.0593 sec

Monary takes 0.0079 sec

Monary is 8x faster

Me, and Jason Carey: MongoDB driver engineers, overseers

Kyle Suarez and Matt Cotter: Interns, contributing this summer

Rutgers, Carleton College

Me, and Jason Carey: MongoDB driver engineers, overseers

Kyle Suarez and Matt Cotter: Interns, contributing this summer

Rutgers, Carleton College

Me, and Jason Carey: MongoDB driver engineers, overseers

Kyle Suarez and Matt Cotter: Interns, contributing this summer

We can do sophisticated processing and visualization of that data using SciPy and Matplotlib

- 1. The Weather of the Century: Visualization A. Jesse Jiryu Davis Senior Python Engineer, MongoDB @jessejiryudavis
- 2. Serious MongoDB Talk
- 3. Serious MongoDB Talk Database
- 4. Serious MongoDB Talk
- 5. This Talk
- 6. Where’s the data from?
- 7. Where’s the data from?
- 8. How Much Is There?
- 9. Visualization
- 10. Visualization Pipeline MongoDB PyMongo Python dicts NumPy SciPy Matplotlib
- 11. { ts: ISODate("1991-01-01T00:00:00Z"), position: { type: "Point", coordinates: [ -94.6, 39.117 ] }, airTemperature: { value: 45, quality: "1" } } GeoJSON
- 12. import numpy import pymongo data = [] db = pymongo.MongoClient().my_database for doc in db.collection.find(query): data.append(( doc['position']['coordinates'][0], doc['position']['coordinates'][1], doc['airTemperature']['value'])) arrays = numpy.array(data)
- 13. # NumPy column access syntax. lons = arrays[:, 0] lats = arrays[:, 1] temps = arrays[:, 2]
- 14. from scipy import griddata from matplotlib import pyplot xs = numpy.linspace(-180, 180, 361) ys = numpy.linspace(-90, 90, 181) zs = griddata(lats, lons, temps, (xs, ys), method='linear') Magic!! pyplot.contour(xs, ys, zs) Also magic!!
- 15. from matplotlib import pyplot xs = numpy.linspace(-180, 180, 361) ys = numpy.linspace(-90, 90, 181) zs = griddata(lats, lons, temps, (xs, ys), method='linear') pyplot.contour(xs, ys, zs)
- 16. Triangulation
- 17. Triangulation
- 18. Triangulation What temperature?
- 19. Barycentric Interpolation What temperature? 53 48 54 Weighted Average 51.1
- 20. Interpolation 51.1
- 21. Interpolation
- 22. Interpolation
- 23. Contours
- 24. Contours
- 25. import numpy import pymongo Not terrifically fast data = [] db = pymongo.MongoClient().my_database for doc in db.collection.find(query): data.append(( doc['position']['coordinates'][0], doc['position']['coordinates'][1], doc['airTemperature']['value'])) arrays = numpy.array(data)
- 26. Analyzing large datasets • Querying: 109k documents per second • (On localhost) • Can we go faster? • Enter “Monary”
- 27. Monary by David Beach MongoDB PyMongo Python dicts NumPy Matplotlib MongoDB Monary NumPy Matplotlib
- 28. import monary data = [] connection = monary.Monary() arrays = monary_connection.query( db='my_database', coll='collection', query=query, fields=[ 'position.coordinates.0', 'position.coordinates.1', 'airTemperature.value'], types=[ 'float32', 'float32', 'float32'])
- 29. Monary • PyMongo: 109k documents per second • Monary: 817k documents per second
- 30. Visualization
- 31. • Author: David Beach • Interns: Kyle Suarez Matt Cotter • Mentors: A. Jesse Jiryu Davis Jason Carey Monary
- 32. Monary Recent features: • Easy installation • Nested field access • Aggregation • Python 3
- 33. • Insert, update, remove • SSL and authentication mechanisms • parallelCollectionScan Monary Future:
- 34. Thanks • Monary • NumPy • SciPy • Matplotlib
- 35. Thanks
- 36. Thank you A. Jesse Jiryu Davis Senior Python Engineer, MongoDB #MongoDBWorld

No public clipboards found for this slide

Be the first to comment