• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Juliana Freire PPT
 

Juliana Freire PPT

on

  • 862 views

 

Statistics

Views

Total Views
862
Views on SlideShare
646
Embed Views
216

Actions

Likes
0
Downloads
11
Comments
0

2 Embeds 216

http://gov30.typepad.com 187
http://www.kevinmhansen.com 29

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Juliana Freire PPT Juliana Freire PPT Presentation Transcript

    • Exploring Big and not so Big Data: Opportunities and Challenges Juliana Freire juliana.freire@nyu.edu Visualization and Data Analysis (ViDA) Center http://bigdata.poly.edu NYU Poly
    • Big Data: What is the Big deal? http://www.google.com/trends/explore#q=%22big%20data%22!ViDA Center Juliana Freire 2
    • Big Data: What is the Big deal?   Many success stories –  Google: many billions of pages indexed, products, structured data –  Facebook: 1.1 billion users using the site each month –  Twitter: 517 million accounts, 250 million tweets/day   This is changing society!ViDA Center Juliana Freire 3
    • Big Data: What is the Big deal?  Smart Cities: 50% of the world population lives in cities –  Census, crime, emergency visits, cabs, public transportation, real estate, noise, energy, … –  Make cities more efficient and sustainable, and improve the lives of their citizens http://www.nyu.edu/about/university-initiatives/center-for-urban-science-progress.html  Enable scientific discoveries: science is now data rich –  Petabytes of data generated each day, e.g., Australian radio telescopes, Large Hadron Collider –  Social data, e.g., Facebook, Twitter (2,380,000 and 2,880,000 results in Google Scholar!)  Data is currencyViDA Center Juliana Freire 4
    • Big Data: What is the Big deal?  Smart Cities –  Census, crime, emergency visits, cabs, public transportation, real estate, noise, energy, … –  Make cities more efficient and sustainable, and improve the lives of their citizens  Enable scientific discoveries: science is now data rich –  Petabytes of data generated each day, e.g., Australian radio telescopes, Large Hadron Collider –  Social data, e.g., Facebook, Twitter  Data is currencyViDA Center Juliana Freire 5
    • Big Data: What is the Big deal?   Big data is not new: financial transactions, call detail records, astronomy, …   What is new is that there are many more data enthusiasts   More data are widely available, e.g.,and Halperin, DEB 2012 Plot from Howe Web, data.gov, data volumes, % IT investment Astronomy scientific data   Computing is cheap and easy to access Physics –  Server with 64 cores, 512GB RAM ~$11k –  ClusterMedicine1000 cores ~$150k with –  Pay as you go: Amazon EC2 Geosciences 2020 Microbiology Chemistry Social Sciences 2010 rankViDA Center Juliana Freire 6
    • Big Data: What is the Big deal?   Big data is not new: financial transactions, call detail records, astronomy, …   What is new is that there are many more data enthusiasts   More data are widely available, e.g., Web, data.gov, scientific data, social and urban data   Computing is cheap and easy to access –  Server with 64 cores, 512GB RAM ~$11k –  Cluster with 1000 cores ~$150k –  Pay as you go: Amazon EC2ViDA Center Juliana Freire 7
    • Big Data: What is hard?   Scalability is not the problem…   Usability is the Big issue algorithms data visual encodings technology user interfaces statistics provenance interaction modes math machine learning data management data knowledgeViDA Center Juliana Freire 8
    • algorithms data visual encodings technology user interfaces statistics provenance interaction modes math machine learning data managementdata knowledge Exploring data is hard
    • algorithms data visual encodings technology user interfaces statistics provenance interaction modes math machine learning data managementdata knowledge Exploring data is hard, regardless of whether the data is big or small
    • Case Study: Studying Cab Trips in NYC Prepare data for analysis   Raw data for 2011 63 GB –  24 csv files, 2 csv files for each month - one for trip data, and snother for fare data –  ~170M trips   Cleaning –  ~60,000 fare records do not have trip records –  ~200 duplicates per monthViDA Center Juliana Freire 11
    • Storage Solutions: Temporal Queries   SQLite   Custom storage –  20 GB of storage –  12 GB of storage (in- (index on memory binary search pickup_time) instead of index) –  Ordered queries: –  Ordered queries: 0.6s 9.39s –  Reverse ordered –  Reverse ordered queries: 1.4s queries: 9.41s –  Shuffled queries: 1.2s –  Shuffled queries: 9.37sViDA Center Juliana Freire 12
    • Storage Solutions: Spatial-Temporal   All trips for a week in a given region   All trips in a week for a given taxi   All trips in a week for a given taxi in a given region Needs a complex indexing scheme that combines spatial, temporal, and taxi id searchesViDA Center Juliana Freire 13
    • Storage Solutions: Spatial-Temporal   SQLite   Custom storage –  20+10 GB of storage (ours) (index on time and –  12+4 GB of storage id, r-tree for (using (4d) kd-tree coordinates) on time, id and –  Creating indexes: coordinates) 52hrs –  Building kd-tree: 8 –  Range queries: 2.1s mins –  Combined queries: –  Range queries: 0.2s 15.3s –  Combined queries: –  Cross-table queries: 0.2s 57s –  Cross-table queries: 2sViDA Center Juliana Freire 14
    • Summary Statistics   13,237 Medallion Cabs Analysis/Modeling   42,000 Taxi Drivers   Average Number of Rides: 485k/day   Average Number of Passengers: 660k/day Rides in 2011590k 29k Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Apr 2 Aug 28 Dec 25ViDA Center Apr 3 Irene Juliana Freire 15
    • Weekly Patterns 0h Rides per Hour June 2011Between5k and 35krides/hour Night Life! Rides at Midnight Analysis/ Modeling 0h 0h 0h 0h 0h ViDA Center Juliana Freire 16
    • TLCVisViDA Center Juliana Freire 17
    • Drop-offs vs. Pickups Drop-off Pickup Most of the drop- off’s occur on the avenues while most of the pick- up’s occur on the streetsViDA Center Juliana Freire 18
    • Studying Anomalies Sunday, May 1st 2011 4:00AM-4:30AM 6:00AM-6:30AM 8:00AM-8:30AMViDA Center Juliana Freire 19
    • Studying Anomalies Sunday, May 1st 2011 4:00AM-4:30AM 6:00AM-6:30AM 8:00AM-8:30AMViDA Center Juliana Freire 20
    • Studying Anomalies Sunday, May 1st 2011 8:00AM-8:30AM 9:30AM-10:00AMViDA Center Juliana Freire 21
    • Studying Anomalies Interpretation Sunday, May 1st 2011 8:00AM-8:30AM 9:30AM-10:00AM Five Borough Bike TourViDA Center Juliana Freire 22
    • Studying Anomalies Sunday May 1st 2011 07:00AM-08:00AMViDA Center Juliana Freire 23
    • Studying Anomalies Sunday May 1st 2011 08:00AM-10:00AMViDA Center Juliana Freire 24
    • Studying Anomalies Sunday May 1st 2011 10:00AM-11:00AMViDA Center Juliana Freire 25
    • Studying Patterns May 1st – May 7th 2011 3.6 Million Trips Compare movement in the airports against the large train stationsViDA Center Juliana Freire 26
    • Studying Patterns Train Stations Airports May 1st – May 7th 2011 3.6 Million TripsViDA Center Juliana Freire 27
    • Studying Patterns Train Stations Airports May 1st – May 7th 2011 3.6 Million TripsViDA Center Juliana Freire 28
    • Data exploration reveals bad data…ViDA Center Juliana Freire 29
    • Uses of Clean Data: FindMeACab AppViDA Center Juliana Freire 30
    • Take Away   Data exploration is challenging for both small and big data   It is hard to prepare data for exploration   For many tasks, existing tools are either too cumbersome, not scalable, etc.   Need better, usable tools –  Tools for data enthusiasts who are not computer scientists!   Visualization is essential for exploring large volumes of data --- “A picture is worth a thousand words’’   Pictures help us think [Tamara Munzner] –  Substitute perception for cognition –  Free up limited cognitive/memory resources for higher- level problemsViDA Center Juliana Freire 31
    • Masters in Big Data   New degree at NYU Poly – Spring 2014   Courses: –  Machine learning –  Massive data analysis –  Visualization –  Visual Analytics –  Database Systems –  Algorithms –  …ViDA Center Juliana Freire 32
    • Thanks