Data Driven Societies
Digital & Computational Studies
Bowdoin College
April 14, 2014
Professor Gieseking
Lecture Slides "Visualizing Social Life (When They Let You)"
A presentation prepared for KSFR, a public radio station in Santa Fe, New Mexico, USA. The main point is that the station should develop a "digital first" approach to all aspects pertaining to its Audience(s), Content and Technologies.
In democracies, demonstrating is a legitimate way for citizens to let their officials know how they feel about important topics and try to change policies or attitudes. Peaceful demonstrations are powerful to keep the checks and balances in democracies. As we have seen over the ages (going back to Roman times), once demonstrations turn into riots, democracies are shaken to the core. During a riot, the fine line between being an activist and a criminal is often crossed.
For law enforcement, restoring and keeping order is a challenge. It involves identifying the agitators, those actors who believe that violent means justify the cause, and those who join demonstrations (often in other cities) to create trouble. Law enforcement needs to have the tools to identify and separate the bad apples from the rest to protect the fundamental democratic right to demonstrate.
Data Driven Societies
Digital & Computational Studies
Bowdoin College
April 14, 2014
Professor Gieseking
Lecture Slides "Visualizing Social Life (When They Let You)"
A presentation prepared for KSFR, a public radio station in Santa Fe, New Mexico, USA. The main point is that the station should develop a "digital first" approach to all aspects pertaining to its Audience(s), Content and Technologies.
In democracies, demonstrating is a legitimate way for citizens to let their officials know how they feel about important topics and try to change policies or attitudes. Peaceful demonstrations are powerful to keep the checks and balances in democracies. As we have seen over the ages (going back to Roman times), once demonstrations turn into riots, democracies are shaken to the core. During a riot, the fine line between being an activist and a criminal is often crossed.
For law enforcement, restoring and keeping order is a challenge. It involves identifying the agitators, those actors who believe that violent means justify the cause, and those who join demonstrations (often in other cities) to create trouble. Law enforcement needs to have the tools to identify and separate the bad apples from the rest to protect the fundamental democratic right to demonstrate.
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
What is big data, and what are its potential benefits and risks?
Presentation given by Sir Mark Walport at the Oxford Martin School on 3 December 2013.
Making Decisions in a World Awash in Data: We’re going to need a different bo...Micah Altman
In his abstract, Scriffignano summarizes as follows:
l explore some of the ways in which the massive availability of data is changing and the types of questions we must ask in the context of making business decisions. Truth be told, nearly all organizations struggle to make sense out of the mounting data already within the enterprise. At the same time, businesses, individuals, and governments continue to try to outpace one another, often in ways that are informed by newly-available data and technology, but just as often using that data and technology in alarmingly inappropriate or incomplete ways. Multiple “solutions” exist to take data that is poorly understood, promising to derive meaning that is often transient at best. A tremendous amount of “dark” innovation continues in the space of fraud and other bad behavior (e.g. cyber crime, cyber terrorism), highlighting that there are very real risks to taking a fast-follower strategy in making sense out of the ever-increasing amount of data available. Tools and technologies can be very helpful or, as Scriffignano puts it, “they can accelerate the speed with which we hit the wall.” Drawing on unstructured, highly dynamic sources of data, fascinating inference can be derived if we ask the right questions (and maybe use a bit of different math!). This session will cover three main themes: The new normal (how the data around us continues to change), how are we reacting (bringing data science into the room), and the path ahead (creating a mindset in the organization that evolves). Ultimately, what we learn is governed as much by the data available as by the questions we ask. This talk, both relevant and occasionally irreverent, will explore some of the new ways data is being used to expose risk and opportunity and the skills we need to take advantage of a world awash in data.
Cylcia Bolibaugh spoke about reproducibility, open data and GDPR at the first Open Data in Practice event at the University of York on 15 November 2018.
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Krishnaram Kenthapadi
Preserving privacy of users is a key requirement of web-scale data mining applications and systems such as web search, recommender systems, crowdsourced platforms, and analytics applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. In this tutorial, we will first present an overview of privacy breaches over the last two decades and the lessons learned, key regulations and laws, and evolution of privacy techniques leading to differential privacy definition / techniques. Then, we will focus on the application of privacy-preserving data mining techniques in practice, by presenting case studies such as Apple's differential privacy deployment for iOS / macOS, Google's RAPPOR, LinkedIn Salary, and Microsoft's differential privacy deployment for collecting Windows telemetry. We will conclude with open problems and challenges for the data mining / machine learning community, based on our experiences in industry.
Managing and publishing sensitive data in the social sciences - Webinar trans...ARDC
Transcript of the 29th March ANDS webinar.
Slides and recording are available from the ANDS website: http://www.ands.org.au/news-and-events/presentations/2017
New Developments in Machine Learning - Prof. Dr. Max WellingTextkernel
Presentation from Prof. Dr. Max Welling, Professor of Machine Learning at the University of Amsterdam, at Textkernel's Intelligent Machines and the Future of Recruitment on June 2nd in Amsterdam.
At the end of this slide deck, you can also find the YouTube recording.
Due to increased compute power and large amounts of available data, machine learning is flourishing once again. In particular a technology called deep learning is making great strides maturing into a powerful technology. Max Welling briefly discusses variants of deep learning, such as convolutional neural networks and recurrent neural networks. But what lies around the corner in machine learning? He will discuss the three developments that in his opinion will become increasingly important:
1) Learning to interact with the world through reinforcement learning,
2) Learning while respecting everyone's privacy, and
3) Learning the causal relations in data (as opposed to discovering mere correlations).
Together, they represent the "power tools" of the future machine learner.
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...Micah Altman
Ophir Frieder, who holds the Robert L. McDevitt, K.S.G., K.C.H.S. and Catherine H. McDevitt L.C.H.S. Chair in Computer Science and Information Processing at Georgetown University, gave this talk on Searching in Harsh Environments as part of the Program on Information Science Brown Bag Series.
In the talk, illustrated by the slides below, Ophir rebuts the myth that "google has solved search", and discusses the challenges of searching for complex object, through hidden collections, and in harsh environments For more see: http://informatics.mit.edu/blg
The Potential of Forensic Genetics in Resolving the Fate of the Missing
Thomas Parsons
Director of Forensic Science
International Commission on Missing Persons
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
1. Risks and mitigations of
releasing data
Risk analysis and complexity
in de-identifying and
releasing data.
Sara-Jayne Terp
RDF Discussion
2. First, Do No Harm
“If you make a dataset public, you
have a responsibility, to the best of
your knowledge, skills, and advice, to do
no harm to the people connected to that dataset. You
balance making data available
to people who can do good with
it and protecting the data
subjects, sources, and
managers.”
2
4. RISK
“The probability of something happening
multiplied by the resulting cost or benefit
if it does” (Oxford English Dictionary)
Three parts:
•Cost/benefit
•Probability
•Subject (to what/whom)
4
5. Subjects: Physical
5
“Witnesses told us that
a helicopter had been
circling around the
area for hours by the
time the bakery opened
in the afternoon. It
had, perhaps, 200
people lined up to get
bread. Suddenly, the
helicopter dropped a
bomb that hit a building
11. Risk to Whom?
• Data subjects (elections example)
• Data collectors (conflict example)
• Data processing team (military equipment example)
• Person releasing the data (corruption example)
• Person using the data
11
14. PII
“Personally identifiable information (PII) is any data that
could potentially identify a specific individual. Any
information that can be used to distinguish one
person from another and can be used for de-
anonymizing anonymous data can be
considered PII.”
14
15. Learn to spot Red Flags
• Names, addresses, phone numbers
• Locations: lat/long, GIS traces, locality (e.g. home +
work as an identifier)
• Members of small populations
• Untranslated text
• Codes (e.g. “41”)
• Slang terms
• Can be combined with other datasets to produce
PII
15
16. Consider Partial Release
Release to only some groups
• Academics
• People in your organisation
• Data subjects
Release at lower granularity
• Town/district level, not street
• Subset or sample of data ‘rows’
• Subset of data ‘columns’
16
17. Include locals
Locals can spot:
•Local languages
•Local slang
•Innocent-looking phrases
Locals might also choose the risk
17