Juliana Freire PPT

1,035 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,035
On SlideShare
0
From Embeds
0
Number of Embeds
249
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Juliana Freire PPT

  1. 1. Exploring Big and not so Big Data: Opportunities and Challenges Juliana Freire juliana.freire@nyu.edu Visualization and Data Analysis (ViDA) Center http://bigdata.poly.edu NYU Poly
  2. 2. Big Data: What is the Big deal? http://www.google.com/trends/explore#q=%22big%20data%22!ViDA Center Juliana Freire 2
  3. 3. Big Data: What is the Big deal?   Many success stories –  Google: many billions of pages indexed, products, structured data –  Facebook: 1.1 billion users using the site each month –  Twitter: 517 million accounts, 250 million tweets/day   This is changing society!ViDA Center Juliana Freire 3
  4. 4. Big Data: What is the Big deal?  Smart Cities: 50% of the world population lives in cities –  Census, crime, emergency visits, cabs, public transportation, real estate, noise, energy, … –  Make cities more efficient and sustainable, and improve the lives of their citizens http://www.nyu.edu/about/university-initiatives/center-for-urban-science-progress.html  Enable scientific discoveries: science is now data rich –  Petabytes of data generated each day, e.g., Australian radio telescopes, Large Hadron Collider –  Social data, e.g., Facebook, Twitter (2,380,000 and 2,880,000 results in Google Scholar!)  Data is currencyViDA Center Juliana Freire 4
  5. 5. Big Data: What is the Big deal?  Smart Cities –  Census, crime, emergency visits, cabs, public transportation, real estate, noise, energy, … –  Make cities more efficient and sustainable, and improve the lives of their citizens  Enable scientific discoveries: science is now data rich –  Petabytes of data generated each day, e.g., Australian radio telescopes, Large Hadron Collider –  Social data, e.g., Facebook, Twitter  Data is currencyViDA Center Juliana Freire 5
  6. 6. Big Data: What is the Big deal?   Big data is not new: financial transactions, call detail records, astronomy, …   What is new is that there are many more data enthusiasts   More data are widely available, e.g.,and Halperin, DEB 2012 Plot from Howe Web, data.gov, data volumes, % IT investment Astronomy scientific data   Computing is cheap and easy to access Physics –  Server with 64 cores, 512GB RAM ~$11k –  ClusterMedicine1000 cores ~$150k with –  Pay as you go: Amazon EC2 Geosciences 2020 Microbiology Chemistry Social Sciences 2010 rankViDA Center Juliana Freire 6
  7. 7. Big Data: What is the Big deal?   Big data is not new: financial transactions, call detail records, astronomy, …   What is new is that there are many more data enthusiasts   More data are widely available, e.g., Web, data.gov, scientific data, social and urban data   Computing is cheap and easy to access –  Server with 64 cores, 512GB RAM ~$11k –  Cluster with 1000 cores ~$150k –  Pay as you go: Amazon EC2ViDA Center Juliana Freire 7
  8. 8. Big Data: What is hard?   Scalability is not the problem…   Usability is the Big issue algorithms data visual encodings technology user interfaces statistics provenance interaction modes math machine learning data management data knowledgeViDA Center Juliana Freire 8
  9. 9. algorithms data visual encodings technology user interfaces statistics provenance interaction modes math machine learning data managementdata knowledge Exploring data is hard
  10. 10. algorithms data visual encodings technology user interfaces statistics provenance interaction modes math machine learning data managementdata knowledge Exploring data is hard, regardless of whether the data is big or small
  11. 11. Case Study: Studying Cab Trips in NYC Prepare data for analysis   Raw data for 2011 63 GB –  24 csv files, 2 csv files for each month - one for trip data, and snother for fare data –  ~170M trips   Cleaning –  ~60,000 fare records do not have trip records –  ~200 duplicates per monthViDA Center Juliana Freire 11
  12. 12. Storage Solutions: Temporal Queries   SQLite   Custom storage –  20 GB of storage –  12 GB of storage (in- (index on memory binary search pickup_time) instead of index) –  Ordered queries: –  Ordered queries: 0.6s 9.39s –  Reverse ordered –  Reverse ordered queries: 1.4s queries: 9.41s –  Shuffled queries: 1.2s –  Shuffled queries: 9.37sViDA Center Juliana Freire 12
  13. 13. Storage Solutions: Spatial-Temporal   All trips for a week in a given region   All trips in a week for a given taxi   All trips in a week for a given taxi in a given region Needs a complex indexing scheme that combines spatial, temporal, and taxi id searchesViDA Center Juliana Freire 13
  14. 14. Storage Solutions: Spatial-Temporal   SQLite   Custom storage –  20+10 GB of storage (ours) (index on time and –  12+4 GB of storage id, r-tree for (using (4d) kd-tree coordinates) on time, id and –  Creating indexes: coordinates) 52hrs –  Building kd-tree: 8 –  Range queries: 2.1s mins –  Combined queries: –  Range queries: 0.2s 15.3s –  Combined queries: –  Cross-table queries: 0.2s 57s –  Cross-table queries: 2sViDA Center Juliana Freire 14
  15. 15. Summary Statistics   13,237 Medallion Cabs Analysis/Modeling   42,000 Taxi Drivers   Average Number of Rides: 485k/day   Average Number of Passengers: 660k/day Rides in 2011590k 29k Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Apr 2 Aug 28 Dec 25ViDA Center Apr 3 Irene Juliana Freire 15
  16. 16. Weekly Patterns 0h Rides per Hour June 2011Between5k and 35krides/hour Night Life! Rides at Midnight Analysis/ Modeling 0h 0h 0h 0h 0h ViDA Center Juliana Freire 16
  17. 17. TLCVisViDA Center Juliana Freire 17
  18. 18. Drop-offs vs. Pickups Drop-off Pickup Most of the drop- off’s occur on the avenues while most of the pick- up’s occur on the streetsViDA Center Juliana Freire 18
  19. 19. Studying Anomalies Sunday, May 1st 2011 4:00AM-4:30AM 6:00AM-6:30AM 8:00AM-8:30AMViDA Center Juliana Freire 19
  20. 20. Studying Anomalies Sunday, May 1st 2011 4:00AM-4:30AM 6:00AM-6:30AM 8:00AM-8:30AMViDA Center Juliana Freire 20
  21. 21. Studying Anomalies Sunday, May 1st 2011 8:00AM-8:30AM 9:30AM-10:00AMViDA Center Juliana Freire 21
  22. 22. Studying Anomalies Interpretation Sunday, May 1st 2011 8:00AM-8:30AM 9:30AM-10:00AM Five Borough Bike TourViDA Center Juliana Freire 22
  23. 23. Studying Anomalies Sunday May 1st 2011 07:00AM-08:00AMViDA Center Juliana Freire 23
  24. 24. Studying Anomalies Sunday May 1st 2011 08:00AM-10:00AMViDA Center Juliana Freire 24
  25. 25. Studying Anomalies Sunday May 1st 2011 10:00AM-11:00AMViDA Center Juliana Freire 25
  26. 26. Studying Patterns May 1st – May 7th 2011 3.6 Million Trips Compare movement in the airports against the large train stationsViDA Center Juliana Freire 26
  27. 27. Studying Patterns Train Stations Airports May 1st – May 7th 2011 3.6 Million TripsViDA Center Juliana Freire 27
  28. 28. Studying Patterns Train Stations Airports May 1st – May 7th 2011 3.6 Million TripsViDA Center Juliana Freire 28
  29. 29. Data exploration reveals bad data…ViDA Center Juliana Freire 29
  30. 30. Uses of Clean Data: FindMeACab AppViDA Center Juliana Freire 30
  31. 31. Take Away   Data exploration is challenging for both small and big data   It is hard to prepare data for exploration   For many tasks, existing tools are either too cumbersome, not scalable, etc.   Need better, usable tools –  Tools for data enthusiasts who are not computer scientists!   Visualization is essential for exploring large volumes of data --- “A picture is worth a thousand words’’   Pictures help us think [Tamara Munzner] –  Substitute perception for cognition –  Free up limited cognitive/memory resources for higher- level problemsViDA Center Juliana Freire 31
  32. 32. Masters in Big Data   New degree at NYU Poly – Spring 2014   Courses: –  Machine learning –  Massive data analysis –  Visualization –  Visual Analytics –  Database Systems –  Algorithms –  …ViDA Center Juliana Freire 32
  33. 33. Thanks

×