The document discusses prototyping location apps with real data. It describes generating realistic datasets of people moving around cities by gathering check-in data from Foursquare tweets and visualizing the check-ins on maps. It also discusses generating social networks by extracting people and connection data from Wikipedia and DBpedia, including types of entities and links between pages. Code examples are provided to load and filter this data using Pig scripts on Amazon EMR.
The Next Wave of AR: Mobile Social Interaction Right Here, Right Now!Tish Shute
I began by asking the question: Can we create an open framework for distributed augmented reality using "off the shelf" standards, e.g., the Google Wave Federation Protocol?
But the implications of this proposal go well beyond augmented reality and towards an open framework for in context mobile social communication.
Also see video here http://www.mobilemonday.nl/talks/tish-shute-the-next-wave-of-ar/
Drupalcon keynote: Open Source and Open Data in the age of the cloudTim O'Reilly
My keynote at Drupalcon SF on April 20, 2009. Similar to my talk at OSBC, MySQL and Greenplum, but with a bit of a drupal twist. Ending riff on DIY inspired by Isaiah Saxon's comments on my MySQL keynote.
My keynote from the Open Compute Platform Summit in Santa Clara, CA on January 16, 2013. I talk about the influence of open source on the history of computing, starting with von Neumann, and end with a vision of the "Internet Operating System" behind modern applications, and the question of who will control that operating system software and hardware.
The Next Wave of AR: Mobile Social Interaction Right Here, Right Now!Tish Shute
I began by asking the question: Can we create an open framework for distributed augmented reality using "off the shelf" standards, e.g., the Google Wave Federation Protocol?
But the implications of this proposal go well beyond augmented reality and towards an open framework for in context mobile social communication.
Also see video here http://www.mobilemonday.nl/talks/tish-shute-the-next-wave-of-ar/
Drupalcon keynote: Open Source and Open Data in the age of the cloudTim O'Reilly
My keynote at Drupalcon SF on April 20, 2009. Similar to my talk at OSBC, MySQL and Greenplum, but with a bit of a drupal twist. Ending riff on DIY inspired by Isaiah Saxon's comments on my MySQL keynote.
My keynote from the Open Compute Platform Summit in Santa Clara, CA on January 16, 2013. I talk about the influence of open source on the history of computing, starting with von Neumann, and end with a vision of the "Internet Operating System" behind modern applications, and the question of who will control that operating system software and hardware.
Latin America & Caribbean Regional Outlook June 2013WB_Research
http://www.worldbank.org/globaloutlook
After a sharp recovery from the global economic crisis in 2010, when regional output expanded by 6 percent, growth in the Latin America and the Caribbean decelerated markedly, to an estimated 3 percent by 2012. Supply side constraints have become apparent in some of the larger economies, where output was near or above potential during the recovery phase, and which contributed to relatively high inflation and deterioration of current account balances. Despite a sharp deceleration in growth, regional output is only now in line with potential GDP.
http://www.worldbank.org/globaloutlook
South Asia’s regional GDP growth slipped to 4.8 percent in 2012, following a robust recovery in the years after the 2008 global financial crisis. A weakening global economy, coupled with domestic difficulties (including policy uncertainties, structural capacity constraints, and a poor harvest) contributed to weaker regional growth in 2012.
Middle East & North Africa Regional Outlook June 2013WB_Research
http://www.worldbank.org/globaloutlook
More than two years after the Arab Spring began, economic activity remains weighed down by elevated political tensions and continued civil strife in the region. Regional growth accelerated to 3.5 percent in 2012 from minus 2.2 percent in 2011 reflecting mainly a rebound in Libya’s crude oil production to pre-war levels that doubled real GDP and a weak growth recovery in Egypt (to 2.2 percent in FY2012 from 1.8 percent in FY2011).
Slides from a talk I gave at Perspectives Workshop on Semantic Web, http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=09271 ... Dagstuhl, Germany 2009-06-29. Title was from Jim Hender!
The Sustainable Development Goal #7 to ensure access to affordable, reliable, sustainable and modern energy for all by 2030 has brought about a renewed focus on the 1.1 billion people around the world without any access to electricity. The increasing commercial viability of off-grid technologies provides an effective and scalable complement to traditional electricity grid expansion, and the opportunity to rapidly improve the livelihoods of millions across the globe.
Our panel of experts discussed the commercial viability and potential of off-grid technologies. Speakers from the World Bank Group, the private sector and non-profit sector shared their perspectives, drawing on their experience and knowledge of current sector trends. The event featured the findings and lessons of a recent IEG study: Reliable and Affordable Off-Grid Electricity Services for the Poor: Lessons from World Bank Group Experience.
This learning event was jointly hosted by the Independent Evaluation Group, the World Bank’s Energy & Extractives Global Practice, and the International Finance Corporation’s Clean Energy and Resource Efficiency Group.
Global Development Horizons 2013: Capital For the FutureWB_Research
By 2030, half the global stock of capital will reside in developing countries, compared to less than one-third today, says report. For more visit: http://www.worldbank.org/CapitalForTheFuture
Latin America & Caribbean Regional Outlook June 2013WB_Research
http://www.worldbank.org/globaloutlook
After a sharp recovery from the global economic crisis in 2010, when regional output expanded by 6 percent, growth in the Latin America and the Caribbean decelerated markedly, to an estimated 3 percent by 2012. Supply side constraints have become apparent in some of the larger economies, where output was near or above potential during the recovery phase, and which contributed to relatively high inflation and deterioration of current account balances. Despite a sharp deceleration in growth, regional output is only now in line with potential GDP.
http://www.worldbank.org/globaloutlook
South Asia’s regional GDP growth slipped to 4.8 percent in 2012, following a robust recovery in the years after the 2008 global financial crisis. A weakening global economy, coupled with domestic difficulties (including policy uncertainties, structural capacity constraints, and a poor harvest) contributed to weaker regional growth in 2012.
Middle East & North Africa Regional Outlook June 2013WB_Research
http://www.worldbank.org/globaloutlook
More than two years after the Arab Spring began, economic activity remains weighed down by elevated political tensions and continued civil strife in the region. Regional growth accelerated to 3.5 percent in 2012 from minus 2.2 percent in 2011 reflecting mainly a rebound in Libya’s crude oil production to pre-war levels that doubled real GDP and a weak growth recovery in Egypt (to 2.2 percent in FY2012 from 1.8 percent in FY2011).
Slides from a talk I gave at Perspectives Workshop on Semantic Web, http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=09271 ... Dagstuhl, Germany 2009-06-29. Title was from Jim Hender!
The Sustainable Development Goal #7 to ensure access to affordable, reliable, sustainable and modern energy for all by 2030 has brought about a renewed focus on the 1.1 billion people around the world without any access to electricity. The increasing commercial viability of off-grid technologies provides an effective and scalable complement to traditional electricity grid expansion, and the opportunity to rapidly improve the livelihoods of millions across the globe.
Our panel of experts discussed the commercial viability and potential of off-grid technologies. Speakers from the World Bank Group, the private sector and non-profit sector shared their perspectives, drawing on their experience and knowledge of current sector trends. The event featured the findings and lessons of a recent IEG study: Reliable and Affordable Off-Grid Electricity Services for the Poor: Lessons from World Bank Group Experience.
This learning event was jointly hosted by the Independent Evaluation Group, the World Bank’s Energy & Extractives Global Practice, and the International Finance Corporation’s Clean Energy and Resource Efficiency Group.
Global Development Horizons 2013: Capital For the FutureWB_Research
By 2030, half the global stock of capital will reside in developing countries, compared to less than one-third today, says report. For more visit: http://www.worldbank.org/CapitalForTheFuture
Using AI to Solve Data and IT Complexity -- And Better Enable AIDana Gardner
A discussion on how the rising tidal wave of data must be better managed, and how new tools are emerging to bring artificial intelligence to the rescue.
business model, business model canvas, mission model, mission model canvas, customer development, lean launchpad, lean startup, stanford, startup, steve blank, entrepreneurship, I-Corps, Stanford
IIPGH Webinar 1: Getting Started With Data Scienceds4good
In this webinar for ICT Professionals Ghana, we explore the concepts of data science and its motivations as a recent specialization. creating the background for how Artificial Intelligence relates to Machine Learning and to Deep Learning. We further discuss the data science technology stack and the opportunities that exist in the space.
A recap of interesting points and quotes from the May 2024 WSO2CON opensource application development conference. Focuses primarily on keynotes and panel sessions.
Computer Vision Applications - White Paper Addepto
Computer vision (CV) is an artificial intelligence-based technology that allows computers to observe the world. Find out in our white paper what tools are used to create computer vision solutions. The number of computer vision applications grow every year. Check out real-life examples in retail and marketing industry.
The Evolving Role of the Data Engineer - Whitepaper | QuboleVasu S
A whitepaper about how the evolving data engineering profession helps data-driven companies work smarter and lower cloud costs with Qubole.
https://www.qubole.com/resources/white-papers/the-evolving-role-of-the-data-engineer
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...Neo4j
Today’s complex data is not only big, but also semi-structured and densely connected. In this session we’ll look at how size, structure and connectedness have converged to transform the data landscape. We’ll then go on to look at some of the new opportunities for creating end-user value that have emerged in a world of connected data, illustrated with practical examples drawn from the telecommunications, social media and logistics sectors.
Understanding the New World of Cognitive ComputingDATAVERSITY
Cognitive Computing is a rapidly developing technology that has reached practical application and implementation. So what is it? Do you need it? How can it benefit your business?
In this webinar a panel of experts in Cognitive Computing will discuss the technology, the current practical applications, and where this technology is going. The discussion will start with a review of a recent survey produced by DATAVERSITY on how Cognitive Computing is currently understood by your peers. The panel will also review many components of the technology including:
Cognitive Analytics
Machine Learning
Deep Learning
Reasoning
And next generation artificial intelligence (AI)
And get involved in the discussion with your own questions to present to the panel.
Science Hackday: using visualisation to understand your dataMatt Biddulph
Some pointers to good books, software and code libraries for use in data visualisation.
Lightning talk from Science Hack Day SF.
http://sf.sciencehackday.com/
Presented at Cognitive Cities in Berlin, February 26th 2011.
http://conference.cognitivecities.com/
Video: http://conference.cognitivecities.com/2011/03/matt-biddulph-on-city-analytics/
A perspective on iPhone development from a server-side developer with very little GUI background.
Given at http://www.lfpug.com in London on 26 March 2009.
Blaine Cook couldn't make it to XTech 2008 to give his talk, so Seth Fitzsimmons, Rabble and I did a panel in its place. We used these slides as backing. Uploading them for completeness, as they're not all that useful. But I like the pictures.
"Building Web Apps Togther"
The all-knowledgable webmaster is long gone, replaced by groups of specialists. When they work well together awesome things happen. When they don't the results are ugly, insecure, inaccessible and slow, assuming they launch at all. What's the magic that great teams have in common, and what can we learn from them?
A panel with Paul Hammond (flickr), Simon Willison, Dave Shea, Matt Biddulph (dopplr) and Geroge Oates (flickr)
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
2. Whether you're a new startup looking for investment, or a team at a large company who wants the green light for a new product,
nothing convinces like real running code. But how do you solve the chicken-and-egg problem of filling your early prototype with
real data?
Traffic Photo by TheTruthAbout - http://flic.kr/p/59kPoK
Money Photo by borman818 - http://flic.kr/p/61LYTT
3. As experts in processing large datasets and interpreting charts and graphs, we may think of our data in the same way that a
Bloomberg terminal presents financial information. But information visualisation alone does not make a product.
http://www.flickr.com/photos/financemuseum/2200062668/
4. We need to communicate our understanding of the data to the rest of our product team. We need to be their eyes and ears in the
data - translating human questions into code, and query results into human answers.
5. prototypes are
boundary objects
Instead of communicating across disciplines using language from our own specialisms, we show what we mean in real running
code and designs. We prototype as early as possible, so that we can talk in the language of the product.
http://en.wikipedia.org/wiki/Boundary_object - “allow coordination without consensus as they can allow an actor's local
understanding to be reframed in the context of a some wider collective activity”
http://www.flickr.com/photos/orinrobertjohn/159744546/
6. Prototyping has many potential benefits. We use this triangle to think about how to structure our work and make it clear what
insights we are looking for in a particular project.
7. Novelty
Prototyping has many potential benefits. We use this triangle to think about how to structure our work and make it clear what
insights we are looking for in a particular project.
8. Novelty
lity
id e
F
Prototyping has many potential benefits. We use this triangle to think about how to structure our work and make it clear what
insights we are looking for in a particular project.
9. Novelty
ty De
eli si rab
Fid ilit
y
Prototyping has many potential benefits. We use this triangle to think about how to structure our work and make it clear what
insights we are looking for in a particular project.
10. Novelty
ty De
eli si rab
Fid ilit
y
Prototyping has many potential benefits. We use this triangle to think about how to structure our work and make it clear what
insights we are looking for in a particular project.
11. no more
lorem ipsum
By incorporating analysis and data-science into product design during the prototyping phase, we avoid “lorem ipsum”, the fake
text and made-up data that is often used as a placeholder in design sketches. This helps us understand real-world product use
and find problems earlier.
Photo by R.B. - http://flic.kr/p/8APoN4
12. helping designers explore data
Data can be complex. One of the first things we do when working with a new dataset is create internal toys - “data
explorers” - to help us understand it.
13. Philip Kromer, Infochimps
Flip Kromer of Infochimps describes this process as “hitting the data with the Insight Stick.”
As data scientists, one of our common tasks is to take data from almost any source and apply standard structural techniques to
it without worrying too much about the domain of the data.
14. Philip Kromer, Infochimps
Flip Kromer of Infochimps describes this process as “hitting the data with the Insight Stick.”
As data scientists, one of our common tasks is to take data from almost any source and apply standard structural techniques to
it without worrying too much about the domain of the data.
15. ou can discov er patterns
“With e nough data y t you can't
s using simple counting tha
and fact ophisticated
discover in sma ll data using s
ical and ML a pproaches.” ig on Quora
statist –Dmitriy Ryaboy par
aphrasing Peter Norv
http://b.qr.ae/ijdb2G
Philip Kromer, Infochimps
Flip Kromer of Infochimps describes this process as “hitting the data with the Insight Stick.”
As data scientists, one of our common tasks is to take data from almost any source and apply standard structural techniques to
it without worrying too much about the domain of the data.
16. Here’s a small example of exploring a dataset that I did while working in Nokia’s Location & Commerce division.
17. Searches are goal-driven user behaviour - someone typed something into a search box on a phone. But we can even learn from
activity that isn’t so explicit.
When someone views a Nokia Ovi map on the web or phone, the visuals for the map are served up in square “tiles” from our
servers. We can analyse the number of requests made for each tile and take it as a measure of interest or attention in that part of
the world.
18. Searches are goal-driven user behaviour - someone typed something into a search box on a phone. But we can even learn from
activity that isn’t so explicit.
When someone views a Nokia Ovi map on the web or phone, the visuals for the map are served up in square “tiles” from our
servers. We can analyse the number of requests made for each tile and take it as a measure of interest or attention in that part of
the world.
19. Searches are goal-driven user behaviour - someone typed something into a search box on a phone. But we can even learn from
activity that isn’t so explicit.
When someone views a Nokia Ovi map on the web or phone, the visuals for the map are served up in square “tiles” from our
servers. We can analyse the number of requests made for each tile and take it as a measure of interest or attention in that part of
the world.
20. LA attention heatmap
We built a tool that could calculate metrics for every grid-square of the map of the world, and present heatmaps of
that data on a city level. This view shows which map-tiles are viewed most often in LA using Ovi Maps. It’s calculated
from the server logs of our map-tile servers. You could think of it as a map of the attention our users give to each
tile of LA.
21. LA driving heatmap
This is the same area of California, but instead of map-tile attention it shows the relative number of cars on the road that are
using our navigation features. This gives a whole different view on the city. We can see that it highlights major roads, and it’s
much harder to see where the US coastline occurs. By comparing these two heatmaps we start to understand the meaning and
the potential of these two datasets.
22. But of course a heatmap alone isn’t a product. This is one of the visualisation sketches produced by designer Tom
Coates after investigating the data using the heatmap explorer. It’s much closer to something that could go into a
real product.
23. Tools
These are the tools I’ll be using to demo some of my working processes.
24.
25. Apache Pig makes Hadoop much easier to use by creating map-reduce plans from SQL-like scripts.
32. Realistic cities
generating a dataset of people
moving around town
The first dataset we’ll generate is one you could use to test any system or app involving people moving around the
world - whether it’s an ad-targeting system or a social network.
33. You probably know about Stamen’s beautiful work creating new renderings of OpenStreetMap, including this Toner
style.
34. When they were getting ready to launch their newest tiles called Watercolor, they created this rendering of the access
logs from their Toner tileservers. It shows which parts of the map are most viewed by users of Toner-based apps.
35. Working with data and inspiration from Eric Fischer, Nathaniel Kelso of Stamen generated this map to decide how
deep to pre-render each area of the world to get the maximum hit-rate on their servers. Rendering the full map to
the deepest zoom would have taken years on their servers. The data used as a proxy for the attention of users is a
massive capture of geocoded tweets. The more tweets per square mile, the deeper the zoom will be rendered in that
area.
36. We can go further than geocoded tweets and get a realistic set of POIs that people go to, with timestamps. If you
search for 4sq on the Twitter streaming API you get about 25,000 tweets per hour announcing users’ Foursquare
checkins.
39. And if you view source, the data’s all there in JSON format.
40. Demo:
Gathering Foursquare tweets
So I set up a script to skim the tweets, perform the HTTP requests on 4sq.com and capture the tweet+checkin data as
lines of JSON in files in S3.
41. For this demo I wanted to show just people in San Francisco so I looked up a bounding-box for San Francisco.
42. DEFINE json2tsv `json2tsv.rb` SHIP('/home/hadoop/pig/
json2tsv.rb','/home/hadoop/pig/json.tar');
A = LOAD 's3://mattb-4sq';
B = STREAM A THROUGH json2tsv AS (lat:float, lng:float,
venue, nick, created_at, tweet);
SF = FILTER B BY lat > 37.604031 AND lat < 37.832371 AND
lng > -123.013657 AND lng < -122.355301;
PEOPLE = GROUP SF BY nick;
PEOPLE_COUNTED = FOREACH PEOPLE GENERATE
COUNT(SF) AS c, group, SF;
ACTIVE = FILTER PEOPLE_COUNTED BY c >= 5;
RESULT = FOREACH ACTIVE GENERATE
This pig script loads up the JSON and streams it through a ruby script to turn JSON into Tab-Separated data (because
it’s easier to deal with in pig than JSON).
group,FLATTEN(SF);
STORE RESULT INTO 's3://mattb-4sq/active-sf';
43. venue, nick, created_at, tweet);
SF = FILTER B BY lat > 37.604031 AND lat < 37.832371 AND
lng > -123.013657 AND lng < -122.355301;
PEOPLE = GROUP SF BY nick;
PEOPLE_COUNTED = FOREACH PEOPLE GENERATE
COUNT(SF) AS c, group, SF;
ACTIVE = FILTER PEOPLE_COUNTED BY c >= 5;
RESULT = FOREACH ACTIVE GENERATE
group,FLATTEN(SF);
STORE RESULT INTO 's3://mattb-4sq/active-sf';
We filter the data to San Francisco lat-longs, group the data by username and count it. Then we keep only “active”
users - people with more than 5 checkins.
44. Demo:
Visualising checkins with GeoJSON and KML
You can view the path of one individual user as they arrive at SFO and get their rental car at http://maps.google.com/
maps?q=http:%2F%2Fwww.hackdiary.com%2Fmisc%2Fsampledata-
broton.kml&hl=en&ll=37.625585,-122.398124&spn=0.018015,0.040169&sll=37.0625,-95.677068&sspn=36.8631
78,82.265625&t=m&z=15&iwloc=lyrftr:kml:cFxADtCtq9UxFii5poF9Dk7kA_B4QPBI,g475427abe3071143,,
45.
46. Realistic social networks
generating a dataset of social
connections between people
What about the connections between people? What data could we use as a proxy for a large social graph?
47. Wikipedia is full of data about people and the connections between them.
48. The DBpedia project extracts just the metadata from Wikipedia - the types, the links, the geo-coordinates etc.
49. The DBpedia project extracts just the metadata from Wikipedia - the types, the links, the geo-coordinates etc.
50. It’s available as a public dataset that you can attach to an Amazon EC2 instance and look through.
51. There are many kinds of data in separate files (you can also choose your language).
52. We’re going to start with this one. It tells us what “types” each entity is on Wikipedia, parsed out from their the
Infoboxes on their pages.
54. <Autism> <type> <dbpedia.org/ontology/Disease>
<Autism> <type> <www.w3.org/2002/07/owl#Thing>
<Aristotle> <type> <dbpedia.org/ontology/Philosopher>
<Aristotle> <type> <dbpedia.org/ontology/Person>
<Aristotle> <type> <www.w3.org/2002/07/owl#Thing>
<Aristotle> <type> <xmlns.com/foaf/0.1/Person>
<Aristotle> <type> <schema.org/Person>
<Bill_Clinton> <type> <dbpedia.org/ontology/OfficeHolder>
<Bill_Clinton> <type> <dbpedia.org/ontology/Person>
<Bill_Clinton> <type> <www.w3.org/2002/07/owl#Thing>
<Bill_Clinton> <type> <xmlns.com/foaf/0.1/Person>
<Bill_Clinton> <type> <schema.org/Person>
And these are the ones we’re going to need; just the people.
55.
56. Then we’ll take the file that shows which pages link to which other Wikipedia pages.
57. <http://dbpedia.org/resource/Bill_Clinton> -> Woody_Freeman
<http://dbpedia.org/resource/Bill_Clinton> -> Yasser_Arafat
<http://dbpedia.org/resource/Bill_Dodd> -> Bill_Clinton
<http://dbpedia.org/resource/Bill_Frist> -> Bill_Clinton
<http://dbpedia.org/resource/Bob_Dylan> -> Bill_Clinton
<http://dbpedia.org/resource/Bob_Graham> -> Bill_Clinton
<http://dbpedia.org/resource/Bob_Hope> -> Bill_Clinton
And we’ll try to filter it down to just the human relationships.
58. TYPES = LOAD 's3://mattb/instance_types_en.nt.bz2' USING
PigStorage(' ') AS (subj, pred, obj, dot);
PEOPLE_TYPES = FILTER TYPES BY obj == '<http://xmlns.com/
foaf/0.1/Person>';
PEOPLE = FOREACH PEOPLE_TYPES GENERATE subj;
LINKS = LOAD 's3://mattb/page_links_en.nt.bz2' USING
PigStorage(' ') AS (subj, pred, obj, dot);
SUBJ_LINKS_CO = COGROUP PEOPLE BY subj, LINKS BY subj;
SUBJ_LINKS_FILTERED = FILTER SUBJ_LINKS_CO BY NOT
IsEmpty(PEOPLE) AND NOT IsEmpty(LINKS);
SUBJ_LINKS = FOREACH SUBJ_LINKS_FILTERED GENERATE
FLATTEN(LINKS);
OBJ_LINKS_CO = COGROUP PEOPLE BY subj, SUBJ_LINKS BY obj;
Using pig we load up the types file and filter it to just the people (the entities of type Person from the FOAF
ontology).
OBJ_LINKS_FILTERED = FILTER OBJ_LINKS_CO BY NOT
IsEmpty(PEOPLE) AND NOT IsEmpty(SUBJ_LINKS);
OBJ_LINKS = FOREACH OBJ_LINKS_FILTERED GENERATE
59. TYPES = LOAD 's3://mattb/instance_types_en.nt.bz2' USING
PigStorage(' ') AS (subj, pred, obj, dot);
PEOPLE_TYPES = FILTER TYPES BY obj == '<http://xmlns.com/
foaf/0.1/Person>';
PEOPLE = FOREACH PEOPLE_TYPES GENERATE subj;
LINKS = LOAD 's3://mattb/page_links_en.nt.bz2' USING
PigStorage(' ') AS (subj, pred, obj, dot);
SUBJ_LINKS_CO = COGROUP PEOPLE BY subj, LINKS BY subj;
SUBJ_LINKS_FILTERED = FILTER SUBJ_LINKS_CO BY NOT
IsEmpty(PEOPLE) AND NOT IsEmpty(LINKS);
SUBJ_LINKS = FOREACH SUBJ_LINKS_FILTERED GENERATE
FLATTEN(LINKS);
OBJ_LINKS_CO = COGROUP PEOPLE BY subj, SUBJ_LINKS BY obj;
We filter the links to only those whose subject (originating page) is a person.
OBJ_LINKS_FILTERED = FILTER OBJ_LINKS_CO BY NOT
IsEmpty(PEOPLE) AND NOT IsEmpty(SUBJ_LINKS);
OBJ_LINKS = FOREACH OBJ_LINKS_FILTERED GENERATE
60. OBJ_LINKS_CO = COGROUP PEOPLE BY subj, SUBJ_LINKS BY obj;
OBJ_LINKS_FILTERED = FILTER OBJ_LINKS_CO BY NOT
IsEmpty(PEOPLE) AND NOT IsEmpty(SUBJ_LINKS);
OBJ_LINKS = FOREACH OBJ_LINKS_FILTERED GENERATE
FLATTEN(SUBJ_LINKS);
D_LINKS = DISTINCT OBJ_LINKS;
STORE D_LINKS INTO 's3://mattb/people-graph' USING
PigStorage(' ');
And then filter again to only those links that link to a person.
61. OBJ_LINKS_CO = COGROUP PEOPLE BY subj, SUBJ_LINKS BY obj;
OBJ_LINKS_FILTERED = FILTER OBJ_LINKS_CO BY NOT
IsEmpty(PEOPLE) AND NOT IsEmpty(SUBJ_LINKS);
OBJ_LINKS = FOREACH OBJ_LINKS_FILTERED GENERATE
FLATTEN(SUBJ_LINKS);
D_LINKS = DISTINCT OBJ_LINKS;
STORE D_LINKS INTO 's3://mattb/people-graph' USING
PigStorage(' ');
... and store it.
62. <http://dbpedia.org/resource/Bill_Clinton> -> Woody_Freeman
<http://dbpedia.org/resource/Bill_Clinton> -> Yasser_Arafat
<http://dbpedia.org/resource/Bill_Dodd> -> Bill_Clinton
<http://dbpedia.org/resource/Bill_Frist> -> Bill_Clinton
<http://dbpedia.org/resource/Bob_Dylan> -> Bill_Clinton
<http://dbpedia.org/resource/Bob_Graham> -> Bill_Clinton
<http://dbpedia.org/resource/Bob_Hope> -> Bill_Clinton
This is the result in text.
64. Colours show the results of a “Modularity” analysis that finds the clusters of communities within the graph. For
example, the large cyan group containing Barack Obama is all government and royalty.
67. This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not just
social networks. And it’s useful to anyone, not just startups.
68. This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not just
social networks. And it’s useful to anyone, not just startups.
69. This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not just
social networks. And it’s useful to anyone, not just startups.
70. Realistic ranking
generating a dataset of places
ordered by importance
What if we have all this data about people, places or things but we don’t know whether one thing is more important
than another? We can use public data to rank, compare and score.
71. Wikipedia makes hourly summaries of their web traffic available. Each line of each file shows the language and name
of a page on Wikipedia and how many times it was accessed that hour. We can use that attention as a proxy for the
importance of concepts.
76. Van_Ness_Avenue_%28San_Francisco%29
Recreation_Park_%28San_Francisco%29
Broadway_Tunnel_%28San_Francisco%29
Broadway_Street_%28San_Francisco%29
Carville,_San_Francisco
Union_League_Golf_and_Country_Club_of_San_Francisco
Ambassador_Hotel_%28San_Francisco%29
Columbus_Avenue_%28San_Francisco%29
Grand_Hyatt_San_Francisco
Marina_District,_San_Francisco
Pier_70,_San_Francisco
Victoria_Theatre,_San_Francisco
San_Francisco_Glacier
San_Francisco_de_Ravacayco_District
San_Francisco_church
Lafayette_Park,_San_Francisco,_California
Antioch_University_%28San_Francisco%29
San_Francisco_de_Chiu_Chiu
... which looks like this. There are over 400,000 of them.
77. DATA = LOAD 's3://wikipedia-stats/*.gz' USING
PigStorage(' ') AS (lang, name, count:int, other);
ENDATA = FILTER DATA BY lang=='en';
FEATURES = LOAD 's3://wikipedia-stats/features.txt'
USING PigStorage(' ') AS (feature);
FEATURE_CO = COGROUP ENDATA BY name,
FEATURES BY feature;
FEATURE_FILTERED = FILTER FEATURE_CO BY NOT
IsEmpty(FEATURES) AND NOT IsEmpty(ENDATA);
Using pig we filter the page traffic stats to just the English hits.
FEATURE_DATA = FOREACH FEATURE_FILTERED
GENERATE FLATTEN(ENDATA);
78. FEATURES = LOAD 's3://wikipedia-stats/features.txt'
USING PigStorage(' ') AS (feature);
FEATURE_CO = COGROUP ENDATA BY name,
FEATURES BY feature;
FEATURE_FILTERED = FILTER FEATURE_CO BY NOT
IsEmpty(FEATURES) AND NOT IsEmpty(ENDATA);
FEATURE_DATA = FOREACH FEATURE_FILTERED
GENERATE FLATTEN(ENDATA);
NAMES = GROUP FEATURE_DATA BY name;
We filter the entities down to just those that are geo-features.
COUNTS = FOREACH NAMES GENERATE group,
79. GENERATE FLATTEN(ENDATA);
NAMES = GROUP FEATURE_DATA BY name;
COUNTS = FOREACH NAMES GENERATE group,
SUM(FEATURE_DATA.count) as c;
FCOUNT = FILTER COUNTS BY c > 500;
SORTED = ORDER FCOUNT BY c DESC;
STORE SORTED INTO 's3://wikipedia-stats/
features_out.gz' USING PigStorage('t');
We group and sum the statistics by page-name.
80. Successfully read 442775 records from:
"s3://wikipedia-stats/features.txt"
Successfully read 975017055 records from:
"s3://wikipedia-stats/pagecounts-2012012*.gz"
in 4 hours, 19 minutes and 32 seconds
using 4 m1.small instances.
Using a 4-machine Elastic Mapreduce cluster I can process 50Gb of data containing nearly a billion rows in about
four hours.
81. The Castro 2479
Chinatown 2457
Tenderloin 2276
Mission District 1336
Union Square 1283
Nob Hill 952
Bayview-Hunters Point 916
Alamo Square 768
Russian Hill 721
Ocean Beach 661
San Francisco
Pacific Heights 592
Sunset District 573
neighborhoods
0 750 1500 2250
Here are some results. As you’d expect, the neighbourhoods that rank the highest are the most famous ones. Local
residential neighbourhoods come lower down the scale.
82. Hackney 3428
Camden 2498
Tower Hamlets 2378
Newham 1850
Enfield 1830
Croydon 1796
Islington 1624
Southwark 1603
Lambeth 1354
Greenwich 1316
Hammersmith and Fulham 1268
Haringey 1263 London
Harrow 1183 neighbourhoods
Brent 1140
0 1000 2000 3000
Here it is again for London.
83. To demo this ranking in a data toy that anyone can play with, I built an auto-completer using Elasticsearch. I
transformed the pig output into JSON and made an index.
84. Demo:
A weighted autocompleter with Elasticsearch
I exposed this index through a small Ruby webapp written in Sinatra.
85. So we can easily answer questions like “which of the world’s many Chinatown districts are the best-known?”
86. All code for the workshop:
https://github.com/mattb/where2012-workshop