Delivered by Josh Katz (Graphics Editor, The New York Times) at the 2016 New York R Conference on April 8th and 9th at Work-Bench. See the rest of the conference videos & presentations at http://www.rstats.nyc.
CitySDK Linked Data API presentation delivered at the National Open Data Congress, 28 februari 2014, Eindhoven, and, extended, at FutureEverything 2014, Manchester. CitySDK LD API provides services to collect, annotate, link, share and build on Open Data. It is used by cities, developers and small & medium sized enterprises to build applications that scale across departments & cities. More info can be found on http://citysdk.waag.org/api
How google maps uses artificial intelligence to store the data, add the data and various algorithms that can be used behind the accuracy of google maps.
Delivered by Josh Katz (Graphics Editor, The New York Times) at the 2016 New York R Conference on April 8th and 9th at Work-Bench. See the rest of the conference videos & presentations at http://www.rstats.nyc.
CitySDK Linked Data API presentation delivered at the National Open Data Congress, 28 februari 2014, Eindhoven, and, extended, at FutureEverything 2014, Manchester. CitySDK LD API provides services to collect, annotate, link, share and build on Open Data. It is used by cities, developers and small & medium sized enterprises to build applications that scale across departments & cities. More info can be found on http://citysdk.waag.org/api
How google maps uses artificial intelligence to store the data, add the data and various algorithms that can be used behind the accuracy of google maps.
Slides of QCon London 2016 talk. How stream processing is used within the Uber's Marketplace system to solve a wide range problems, including but not limited to realtime indexing and querying of geospatial time series, aggregation and computing of streaming data, and extracting patterns from data streams. In addition, it will touch upon various TimeSeries analysis and predictions. The underlying systems utilize many open source technologies such as Apache Kafka, Samza and Spark streaming.
Roland is currently working with TfL on the Surface Intelligent Transport System, which is looking to improve the insight available from existing and new data sources. Have worked on event driven architectures for many years and across many sectors although with a primary focus on Transport.
How can Open Data Revolutionise your Rail Travel?theODI
Friday Lunchtime Lectures at the Open Data Institute. For our fourth lecture... How can open data revolutionise your rail travel? The release of open data on public transport over the last year has laid bare the the secrets of cheap fares, true timekeeping records and the best alternative routes the official journey planners don't tell you. Jonathan Raper, digital geographer and founder of Placr, will draw back the curtain on this new era and reveal how you can make the most of it.
The Next Generational Shift In Enterprise Infrastructure Has Arrived. If SlideShare is broken, please download report here: https://www.scribd.com/document/352452857/2017-Enterprise-Almanac
AI to Enable Next Generation of People ManagersWork-Bench
In our work with hundreds of top fast growth startups and globally-distributed Fortune 1000 corporations in our enterprise tech ecosystem here in NYC, the most common refrain we hear is: "managing people is hard."
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
6. NYC Taxi and Uber Data
• Taxi & Limousine Commission released
public, trip-level data for over 1.1 billion
taxi rides 2009–2015
• Some public Uber data available as
well, thanks to a FOIL request by
FiveThirtyEight
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
7. Citi Bike Data
• Citi Bike releases monthly data for every
individual ride
• Data includes timestamps and locations,
plus rider’s subscriber status, gender,
and age
https://www.citibikenyc.com/system-data
8. Generic Analysis Overview
1. Get raw data
2. Write code to process raw data into
something more useful
3. Analyze data
4. Write about what you found out
9. Analysis Tools
• PostgreSQL
• PostGIS
• R
• Command line
• JavaScript
https://github.com/toddwschneider/nyc-taxi-data
https://github.com/toddwschneider/nyc-citibike-data
10. Raw data processing goals
• Load flat files of varying file formats into a unified,
persistent PostgreSQL database that we can use to
answer questions about the data
• Do some one-time calculations to augment the raw
data
• We want to answer neighborhood-based questions,
so we’ll map latitude/longitude coordinates to NYC
census tracts
11. Processing raw data:
The reality
• Often messy, raw data can require massaging
• Not fun, takes a while, but is essential
• Specifically: we have to plan ahead a bit,
anticipate usage patterns, questions we’re going
to ask, then decide on schema
13. Specific issues encountered
with raw taxi data
• Some files contain empty lines and unquoted carriage
returns 😐
• Raw data files have different formats even within the
same cab type 😕
• Some files contain extra columns in every row 😠
• Some files contain extra columns in only some rows 😡
14. How do we load a bunch of
files into a database?
• One at a time!
• Bash script loops through each raw data file, for
each file it executes code to process data and
insert records into a database table
https://github.com/toddwschneider/nyc-taxi-data/blob/master/import_trip_data.sh
15. How do we map latitude and
longitude to census tracts?
• PostGIS!
• Geographic information system (GIS) for
PostgreSQL
• Can do calculations of the form, “is a point inside a
polygon?”
• Every pickup/drop off is a point, NYC’s census
tracts are polygons
17. Shapefiles
• Shapefile format describes geometries like points,
lines, polygons
• Many shapefiles publicly available, e.g. NYC
provides a shapefile that contains definitions for all
census tracts and NTAs
• PostGIS includes functionality to import shapefiles
19. PostGIS: ST_Within()
• ST_Within(geom A, geom B) function returns
true if and only if A is entirely within B
• A = pickup or drop off point
• B = NYC census tract polygon
20. Spatial Indexes
• Problem: determining whether a point is inside an
arbitrary polygon is computationally intensive and
slow
• PostGIS spatial indexes to the rescue!
22. Spatial Indexes
• Determining whether a point is inside a rectangle is easy!
• Spatial indexes store rectangular bounding boxes for
polygons, then when determining if a point is inside a
polygon, calculate in 2 steps:
1. Is the point inside the polygon’s bounding box?
2. If so, is the point inside the polygon itself?
• Most of the time the cheap first check will be false, then
we can skip the expensive second step
23. Putting it all together
• Download NYC census tracts shapefile, import
into database, create spatial index
• Download raw taxi/Uber/Citi Bike data files and
loop through them, one file at a time
• For each file: fix data issues, load into database,
calculate census tracts with ST_Within()
• Wait 3 days and voila!
24. Analysis, a.k.a.“the fun part”
• Ask fun and interesting
questions
• Try to answer them
• Rinse and repeat
25. Taxi maps
• Question: what does a map of every taxi pickup
and drop off look like?
• Each trip has a pickup and drop off location, plot a
bunch of dots at those locations
• Made entirely in R using ggplot2
27. Taxi maps preprocess
• Problem: R can’t fit 1.1 billion rows
• Solution: preprocess data by rounding
lat/long to 4 decimal places (~10
meters), count number of trips at each
aggregated point
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L194-L215
28. Render maps in R
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R
30. • Map the position of every Citi Bike over
the course of a single day
• Google Maps Directions API for cycling
directions
• Leaflet.js for mapping
• Torque.js by CartoDB for animation
Citi Bike Animation
31. • Google Maps cycling directions
have strong bias for dedicated
bike lanes on 1st, 2nd, 8th, and
9th avenues
• Not necessarily true!
Citi Bike Assumptions
33. Modeling the relationship between
the weather and Citi Bike ridership
• Daily ridership data from Citi Bike
• Daily weather data from National Climatic
Data Center: temperature, precipitation,
snow depth
• Devise and calibrate model in R
36. Calibration in R
• Uses nlsLM() function from
minpack.lm package for Levenberg–
Marquardt algorithm to minimize
nonlinear squared error
https://gist.github.com/toddwschneider/bac3350f84b2ff99969d
38. Airport traffic
• Question: how long will my taxi take to get to the
airport?
• LGA, JFK, and EWR are each their own census
tracts
• Get all trips that dropped off in one of those tracts
• Calculate travel times from neighborhoods to airports
40. More fun stuff in the full posts
• On the realism of Die Hard 3
• Relationship between age, gender, and cycling
speed
• Neighborhoods with most nightlife
• East Hampton privacy concerns
• What time do investment bankers arrive at work?
42. What is “medium data”?
No clear answer, but my rough thinking:
• Tiny: fits in spreadsheet
• Small: doesn’t fit in spreadsheet, but fits in RAM
• Medium: too big for RAM, but fits on local hard disk
• Big: too big for local disk, has to be distributed across
many nodes
43. Use the right tool for the job
My personal toolkit (yours may vary!):
• PostgreSQL for storing and aggregating data. Geospatial
calculations with PostGIS extension
• R for modeling and plotting
• Command line tools for looping through files, loading data, text
processing on input data with sed, awk, etc.
• Ruby for making API calls, scraping websites, running web servers,
and sometimes using local rails apps to organize relational data
• JavaScript for interactivity on the web
44. R + PostgresSQL
• The R ↔ Postgres link is invaluable! Use R and
Postgres for the things they’re respectively best at
• Postgres: persisting data in tables, rote number
crunching
• R: calibrating models, plotting
• RPostgreSQL package allows querying Postgres
from within R
45. Tip: pre-aggregate
• Think about how you’re going to access the data, and
consider creating intermediate aggregated tables
which can be used as building blocks for later
analysis
• Example: number of taxi trips grouped by pickup
census tract and date/time truncated to the hour
• Resulting table is only 30 million rows, easier to work
with than full trips table, and can still answer lots of
interesting questions
46. Pre-aggregating example
CREATE TABLE hourly_pickups AS
SELECT
date_trunc('hour', pickup_datetime) AS pickup_hour,
cab_type_id,
pickup_nyct2010_gid,
COUNT(*)
FROM trips
WHERE pickup_nyct2010_gid IS NOT NULL
GROUP BY pickup_hour,
cab_type_id,
pickup_nyct2010_gid;
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L30-L38
47. How to get people to read
your work
• It has to be interesting. If you’re not excited,
probably nobody else is either
• Most people are distracted, and they read things in
“fast scroll” mode. Optimize for them
• The questions you ask are more important than
the methods you use to answer them
48. Specific tips
• Write in short paragraphs with straightforward
language
• Use plenty of section headers
• Good ratio of pictures to text
• Avoid the dreaded “wall of text”
49. Above all…
• Have fun!
• Keep an inquisitive mind.
Observe stuff happening around
you, ask questions about it, try to
answer those questions