Talk at the ACM SIGKDD - Austin Chapter Meeting, March 21, 2012. Paper by Hohyon Ryu, Matthew Lease, and Nicholas Woodward, at the 23rd ACM Conference on Hypertext and Social Media, 2012.
The document discusses the Common Crawl project, which crawls and archives the web. It provides over 8 billion web pages and 120 TB of data that is freely available to anyone. The data includes raw HTML content, metadata, and text-only files. The document outlines some of the ways the Common Crawl data is currently being used, such as for testing Apache Giraph, the maplight political mapping project, image search by Tineye, and sentiment analysis projects. It also discusses future plans to expand the data available and use cases.
Wikidata is a free, collaborative knowledge base maintained by the Wikimedia Foundation that aims to centralize structured data about items and provide machine-readable data to Wikimedia projects and third parties. It contains over 15 million items with 48 million statements and is currently in the process of adding statements about items after initially linking language editions of Wikipedia articles. Wikidata utilizes a variety of tools to visualize and query its structured data.
The document provides statistics on the FAST (Faceted Application of Subject Terminology) thesaurus as of June 2017, including the number of records and types of headings in FAST. It also lists links from FAST to other datasets like Library of Congress Subject Headings and Wikidata. The FAST team is continuing to synchronize and refine processing rules for FAST and developing an import tool. Information is also provided on the FACETVOC-L discussion list focused on faceted controlled vocabularies.
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...Data Con LA
Data modeling can be a challenge for any transition to using Redis. Other databases rely on indexes and rich query languages to resolve limitations in your data modeling options, but this doesn't always work with Redis. I will discuss a data modeling technique that I use to solve my volunteer, personal, and professional data modeling challenges.
The document discusses open government data and accessibility. It promotes using open data from sources like Data.gov, FCC.gov/Data and other open data portals. It also describes Yahoo! tools like YQL and the Yahoo! User Interface Library that can be used to access and mashup open government data through SQL-like queries or to build accessible user interfaces. Example queries and potential mashup ideas involving FCC and broadband data are provided.
Sometimes, a few entertaining social media jokes are just what the doctor ordered to brighten your day.
I’ve scoured the web to find a collection of 20 hilarious resources based on social media for you – each designed at the very least to put a smile on your face. Perhaps even force a chuckle or two. And very possibly, stir up a fit of giggles.
The document discusses the Common Crawl project, which crawls and archives the web. It provides over 8 billion web pages and 120 TB of data that is freely available to anyone. The data includes raw HTML content, metadata, and text-only files. The document outlines some of the ways the Common Crawl data is currently being used, such as for testing Apache Giraph, the maplight political mapping project, image search by Tineye, and sentiment analysis projects. It also discusses future plans to expand the data available and use cases.
Wikidata is a free, collaborative knowledge base maintained by the Wikimedia Foundation that aims to centralize structured data about items and provide machine-readable data to Wikimedia projects and third parties. It contains over 15 million items with 48 million statements and is currently in the process of adding statements about items after initially linking language editions of Wikipedia articles. Wikidata utilizes a variety of tools to visualize and query its structured data.
The document provides statistics on the FAST (Faceted Application of Subject Terminology) thesaurus as of June 2017, including the number of records and types of headings in FAST. It also lists links from FAST to other datasets like Library of Congress Subject Headings and Wikidata. The FAST team is continuing to synchronize and refine processing rules for FAST and developing an import tool. Information is also provided on the FACETVOC-L discussion list focused on faceted controlled vocabularies.
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...Data Con LA
Data modeling can be a challenge for any transition to using Redis. Other databases rely on indexes and rich query languages to resolve limitations in your data modeling options, but this doesn't always work with Redis. I will discuss a data modeling technique that I use to solve my volunteer, personal, and professional data modeling challenges.
The document discusses open government data and accessibility. It promotes using open data from sources like Data.gov, FCC.gov/Data and other open data portals. It also describes Yahoo! tools like YQL and the Yahoo! User Interface Library that can be used to access and mashup open government data through SQL-like queries or to build accessible user interfaces. Example queries and potential mashup ideas involving FCC and broadband data are provided.
Sometimes, a few entertaining social media jokes are just what the doctor ordered to brighten your day.
I’ve scoured the web to find a collection of 20 hilarious resources based on social media for you – each designed at the very least to put a smile on your face. Perhaps even force a chuckle or two. And very possibly, stir up a fit of giggles.
The document discusses memes as an empowering approach to social media that allows people to get away from mainstream culture, be listened to by creating something relatable that others can easily create and share collectively online. While some memes can be hurtful, others are created to inspire and bring about change. The document encourages creating empowering memes and provides links to meme generators for inspiration.
The document discusses the rise of internet memes and how they have changed culture. It explains that new technologies have made content creation, production, and broadcasting easy and accessible to everyone. Memes, which are ideas or behaviors spread through non-genetic means, have proliferated online in countless variations taking any possible digital form. Examples like "Not Bad Obama" demonstrate how memes can emerge from current events and spread virally with many derivatives in a short time, becoming a genuine part of mass culture.
Un meme es una idea, concepto, situación o pensamiento que se puede difundir a través de enlaces, videos, imágenes u otros medios multimedia para entretener a las personas y propagarse de forma viral en las redes sociales. Los memes han invadido Internet en los últimos años y se han convertido en parte de nuestra cultura popular. Surgen de la teoría mimetica de la transmisión cultural propuesta por Richard Dawkins, en la que las personas transmiten memorias sociales y culturales que imitan unas a otras.
The document discusses internet memes, defining them as ideas or units of cultural information that spread rapidly online through sharing. It notes that memes began as edited videos with voiceovers or text that people found contagious and enjoyed spreading through word of mouth. Specifically, the document examines the "Kersal massive" meme, a viral video of three "chavs" rapping about their hometown for a BMX competition, which became popular due to many remixes and crossed over into mainstream media by being featured on a television program.
MEMS (micro-electro-mechanical systems) combine mechanical and electrical functions on a single chip using microfabrication technology. The fabrication process for MEMS is similar to that used for making electronic circuits and involves steps such as chemical deposition, physical deposition, lithography, and etching. MEMS can be used to develop microsensors using materials like metals, polymers, ceramics, semiconductors, and composites. Common applications of MEMS include accelerometers, which have advantages over conventional accelerometers such as lower cost, smaller size, and lower power requirements.
2016 was a ‘meme-ntous’ year. Memes saw people round the world pretend to be mannequins, they impacted the US presidential election, and nearly led the UK government to name a ship “Boaty McBoatface”.
Memes are nothing new: they have been a staple of culture and communications for thousands of years. What is new is the speed with which memes are created, adapted, and spread around the world via social media.
Today, Internet memes are being used to great effect by brands, third-sector organisations and political movements (from the “alt-right” to their far-left alternatives). Opportunities abound for entities who use them well. If you work in communications you need to understand where Internet memes come from, how they work, and how you can use them. This report answers those questions. Enjoy it and get in touch with queries.
Data-driven analysis of how memes (image macros) are used and how the emergent properties of memes are established. Product of active research at: http://whichmeme.info
The document discusses social networking sites and provides statistics about key players and markets in 2010. It summarizes user numbers, revenues, and rankings of top social networking sites like Facebook, Twitter, Myspace, and LinkedIn. It also provides data on the top social networking markets and sites in India and average time spent on different Indian sites. Finally, it discusses revenue models, an external environment analysis, factors for success, and analyzing competitiveness of social media companies.
A Complete Guide To The Best Times To Post On Social Media (And More!)TrackMaven
Do you know the most effective times to post on social media, send an email, or publish a blog? We've broken down the data behind the most effective times to post content on Twitter, Instagram, Facebook, Content Marketing, and Email.
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...Dave McClure
This document provides tips on using emojis, GIFs, and memes to influence people and increase company valuation in 3 sentences or less:
It recommends including emojis, GIFs, and memes in communications as they are quick, easy, universal, and fun ways to engage others that can help influence people and increase a company's valuation. However, it warns not to get arrested by being inappropriate and provides examples of emojis, GIFs, and memes to consider using. It also thanks various internet personalities and celebrities who helped inspire the use of digital media in business communications.
Discovering and Navigating Memes in Social MediaMatthew Lease
Invited talk at SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction (April 3, 2012). Based on paper by Ryu, Lease, and Woodward, to appear at ACM HyperText 2012. Joint work with Hohyon Ryu and Nicholas Woodward.
E Science As A Lens On The World Lazowskaguest43b4df3
The document summarizes a presentation about eScience and its implications. It discusses how eScience is driven by massive amounts of sensor data and requires analysis of large datasets. It also describes how technologies like cloud computing, databases, data mining and machine learning enable eScience. Finally, it argues that eScience capabilities will be essential for any organization to remain competitive in the future.
The document summarizes a presentation about eScience and its implications. It discusses how eScience is driven by massive amounts of sensor data and requires analysis of large datasets using technologies like databases, data mining, machine learning and data visualization on cluster computing systems at enormous scales. It states that eScience capabilities will be required for organizations to remain competitive in the future. It also discusses how technologies like Amazon EC2 enable scalable computing resources for any organization and how broadband access and networks like Internet2 played an important role in enabling eScience.
MapReduce and Hadoop provide a framework for processing vast amounts of data across clusters of computers. It allows distributed processing of large datasets in a parallel and fault-tolerant manner. The key components are HDFS for storage, and MapReduce for distributed processing. HDFS stores data reliably across commodity hardware, while MapReduce breaks jobs into map and reduce tasks that can run in parallel across a cluster.
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Noemi Derzsy
This document summarizes Noemi Derzsy's presentation at PyData New York 2017 about analyzing open NASA datasets. It notes that NASA currently has over 32,000 open datasets across various topics. Derzsy demonstrates how to analyze the datasets using natural language processing and network analysis techniques in Python like word clouds, topic modeling with LDA, text classification, and network analysis to discover relationships between keywords, descriptions and titles. The goal is to better understand and organize the large number of datasets.
Data Science Keys to Open Up OpenNASA DatasetsPyData
By Noemi Derzsy
PyData New York City 2017
Open source data has enabled society to engage in community-based research, and has provided government agencies with more visibility and trust from individuals. I will briefly introduce the openNASA platform with over 32,000 open NASA datasets, and I will present open NASA metadata analysis, and tools for applying NLP/topic modeling techniques to understand open government dataset associations.
Realtime Indexing for Fast Queries on Massive Semi-Structured DataScyllaDB
Rockset is a realtime indexing database that powers fast SQL over semi-structured data such as JSON, Parquet, or XML without requiring any schematization. All data loaded into Rockset are automatically indexed and a fully featured SQL engine powers fast queries over semi-structured data without requiring any database tuning. Rockset exploits the hardware fluidity available in the cloud and automatically grows and shrinks the cluster footprint based on demand. Available as a serverless cloud service, Rockset is used by developers to build data-driven applications and microservices.
In this talk, we discuss some of the key design aspects of Rockset, such as Smart Schema and Converged Index. We describe Rockset's Aggregator Leaf Tailer (ALT) architecture that provides low latency queries on large datasets.Then we describe how you can combine lightweight transactions in ScyllaDB with realtime analytics on Rockset to power an user-facing application.
Startup Bootcamp - Intro to NoSQL/Big Data by DataZoneIdan Tohami
Arthur Gimpel gave an introduction to Big Data and NoSQL databases. He defined the 3 V's of Big Data as volume, velocity and variety. He explained different NoSQL database types including key-value, graph and document databases. Key-value databases provide fast access but querying values is challenging. Graph databases are useful for relationships between entities. Document databases commonly store JSON and include popular examples like MongoDB. Choosing the right database is important and replacing them is not cheap.
This document provides a comparison of SQL and NoSQL databases. It summarizes the key features of SQL databases, including their use of schemas, SQL query languages, ACID transactions, and examples like MySQL and Oracle. It also summarizes features of NoSQL databases, including their large data volumes, scalability, lack of schemas, eventual consistency, and examples like MongoDB, Cassandra, and HBase. The document aims to compare the different approaches of SQL and NoSQL for managing data.
This document outlines a proposed approach to use distributed data mining techniques to help users make sense of large amounts of content in online collaborative spaces. It discusses how "big data" is affecting users' ability to understand discussions. The approach involves preprocessing content, clustering it using Hadoop and Mahout, and generating topic clouds. A case study clusters content from technical forums and finds topic-specific discussions not obvious from category names. The conclusion is that distributed data mining can help summarize huge online discussions and uncover buried topics to support user sensemaking.
On Friday, September 25th Devin Hopps lead us through a presentation on an Introduction to Big Data and how technology has evolved to harness the power of Big Data.
The document discusses memes as an empowering approach to social media that allows people to get away from mainstream culture, be listened to by creating something relatable that others can easily create and share collectively online. While some memes can be hurtful, others are created to inspire and bring about change. The document encourages creating empowering memes and provides links to meme generators for inspiration.
The document discusses the rise of internet memes and how they have changed culture. It explains that new technologies have made content creation, production, and broadcasting easy and accessible to everyone. Memes, which are ideas or behaviors spread through non-genetic means, have proliferated online in countless variations taking any possible digital form. Examples like "Not Bad Obama" demonstrate how memes can emerge from current events and spread virally with many derivatives in a short time, becoming a genuine part of mass culture.
Un meme es una idea, concepto, situación o pensamiento que se puede difundir a través de enlaces, videos, imágenes u otros medios multimedia para entretener a las personas y propagarse de forma viral en las redes sociales. Los memes han invadido Internet en los últimos años y se han convertido en parte de nuestra cultura popular. Surgen de la teoría mimetica de la transmisión cultural propuesta por Richard Dawkins, en la que las personas transmiten memorias sociales y culturales que imitan unas a otras.
The document discusses internet memes, defining them as ideas or units of cultural information that spread rapidly online through sharing. It notes that memes began as edited videos with voiceovers or text that people found contagious and enjoyed spreading through word of mouth. Specifically, the document examines the "Kersal massive" meme, a viral video of three "chavs" rapping about their hometown for a BMX competition, which became popular due to many remixes and crossed over into mainstream media by being featured on a television program.
MEMS (micro-electro-mechanical systems) combine mechanical and electrical functions on a single chip using microfabrication technology. The fabrication process for MEMS is similar to that used for making electronic circuits and involves steps such as chemical deposition, physical deposition, lithography, and etching. MEMS can be used to develop microsensors using materials like metals, polymers, ceramics, semiconductors, and composites. Common applications of MEMS include accelerometers, which have advantages over conventional accelerometers such as lower cost, smaller size, and lower power requirements.
2016 was a ‘meme-ntous’ year. Memes saw people round the world pretend to be mannequins, they impacted the US presidential election, and nearly led the UK government to name a ship “Boaty McBoatface”.
Memes are nothing new: they have been a staple of culture and communications for thousands of years. What is new is the speed with which memes are created, adapted, and spread around the world via social media.
Today, Internet memes are being used to great effect by brands, third-sector organisations and political movements (from the “alt-right” to their far-left alternatives). Opportunities abound for entities who use them well. If you work in communications you need to understand where Internet memes come from, how they work, and how you can use them. This report answers those questions. Enjoy it and get in touch with queries.
Data-driven analysis of how memes (image macros) are used and how the emergent properties of memes are established. Product of active research at: http://whichmeme.info
The document discusses social networking sites and provides statistics about key players and markets in 2010. It summarizes user numbers, revenues, and rankings of top social networking sites like Facebook, Twitter, Myspace, and LinkedIn. It also provides data on the top social networking markets and sites in India and average time spent on different Indian sites. Finally, it discusses revenue models, an external environment analysis, factors for success, and analyzing competitiveness of social media companies.
A Complete Guide To The Best Times To Post On Social Media (And More!)TrackMaven
Do you know the most effective times to post on social media, send an email, or publish a blog? We've broken down the data behind the most effective times to post content on Twitter, Instagram, Facebook, Content Marketing, and Email.
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...Dave McClure
This document provides tips on using emojis, GIFs, and memes to influence people and increase company valuation in 3 sentences or less:
It recommends including emojis, GIFs, and memes in communications as they are quick, easy, universal, and fun ways to engage others that can help influence people and increase a company's valuation. However, it warns not to get arrested by being inappropriate and provides examples of emojis, GIFs, and memes to consider using. It also thanks various internet personalities and celebrities who helped inspire the use of digital media in business communications.
Discovering and Navigating Memes in Social MediaMatthew Lease
Invited talk at SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction (April 3, 2012). Based on paper by Ryu, Lease, and Woodward, to appear at ACM HyperText 2012. Joint work with Hohyon Ryu and Nicholas Woodward.
E Science As A Lens On The World Lazowskaguest43b4df3
The document summarizes a presentation about eScience and its implications. It discusses how eScience is driven by massive amounts of sensor data and requires analysis of large datasets. It also describes how technologies like cloud computing, databases, data mining and machine learning enable eScience. Finally, it argues that eScience capabilities will be essential for any organization to remain competitive in the future.
The document summarizes a presentation about eScience and its implications. It discusses how eScience is driven by massive amounts of sensor data and requires analysis of large datasets using technologies like databases, data mining, machine learning and data visualization on cluster computing systems at enormous scales. It states that eScience capabilities will be required for organizations to remain competitive in the future. It also discusses how technologies like Amazon EC2 enable scalable computing resources for any organization and how broadband access and networks like Internet2 played an important role in enabling eScience.
MapReduce and Hadoop provide a framework for processing vast amounts of data across clusters of computers. It allows distributed processing of large datasets in a parallel and fault-tolerant manner. The key components are HDFS for storage, and MapReduce for distributed processing. HDFS stores data reliably across commodity hardware, while MapReduce breaks jobs into map and reduce tasks that can run in parallel across a cluster.
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Noemi Derzsy
This document summarizes Noemi Derzsy's presentation at PyData New York 2017 about analyzing open NASA datasets. It notes that NASA currently has over 32,000 open datasets across various topics. Derzsy demonstrates how to analyze the datasets using natural language processing and network analysis techniques in Python like word clouds, topic modeling with LDA, text classification, and network analysis to discover relationships between keywords, descriptions and titles. The goal is to better understand and organize the large number of datasets.
Data Science Keys to Open Up OpenNASA DatasetsPyData
By Noemi Derzsy
PyData New York City 2017
Open source data has enabled society to engage in community-based research, and has provided government agencies with more visibility and trust from individuals. I will briefly introduce the openNASA platform with over 32,000 open NASA datasets, and I will present open NASA metadata analysis, and tools for applying NLP/topic modeling techniques to understand open government dataset associations.
Realtime Indexing for Fast Queries on Massive Semi-Structured DataScyllaDB
Rockset is a realtime indexing database that powers fast SQL over semi-structured data such as JSON, Parquet, or XML without requiring any schematization. All data loaded into Rockset are automatically indexed and a fully featured SQL engine powers fast queries over semi-structured data without requiring any database tuning. Rockset exploits the hardware fluidity available in the cloud and automatically grows and shrinks the cluster footprint based on demand. Available as a serverless cloud service, Rockset is used by developers to build data-driven applications and microservices.
In this talk, we discuss some of the key design aspects of Rockset, such as Smart Schema and Converged Index. We describe Rockset's Aggregator Leaf Tailer (ALT) architecture that provides low latency queries on large datasets.Then we describe how you can combine lightweight transactions in ScyllaDB with realtime analytics on Rockset to power an user-facing application.
Startup Bootcamp - Intro to NoSQL/Big Data by DataZoneIdan Tohami
Arthur Gimpel gave an introduction to Big Data and NoSQL databases. He defined the 3 V's of Big Data as volume, velocity and variety. He explained different NoSQL database types including key-value, graph and document databases. Key-value databases provide fast access but querying values is challenging. Graph databases are useful for relationships between entities. Document databases commonly store JSON and include popular examples like MongoDB. Choosing the right database is important and replacing them is not cheap.
This document provides a comparison of SQL and NoSQL databases. It summarizes the key features of SQL databases, including their use of schemas, SQL query languages, ACID transactions, and examples like MySQL and Oracle. It also summarizes features of NoSQL databases, including their large data volumes, scalability, lack of schemas, eventual consistency, and examples like MongoDB, Cassandra, and HBase. The document aims to compare the different approaches of SQL and NoSQL for managing data.
This document outlines a proposed approach to use distributed data mining techniques to help users make sense of large amounts of content in online collaborative spaces. It discusses how "big data" is affecting users' ability to understand discussions. The approach involves preprocessing content, clustering it using Hadoop and Mahout, and generating topic clouds. A case study clusters content from technical forums and finds topic-specific discussions not obvious from category names. The conclusion is that distributed data mining can help summarize huge online discussions and uncover buried topics to support user sensemaking.
On Friday, September 25th Devin Hopps lead us through a presentation on an Introduction to Big Data and how technology has evolved to harness the power of Big Data.
The document discusses scaling web data at low cost. It begins by presenting Javier D. Fernández and providing context about his work in semantic web, open data, big data management, and databases. It then discusses techniques for compressing and querying large RDF datasets at low cost using binary RDF formats like HDT. Examples of applications using these techniques include compressing and sharing datasets, fast SPARQL querying, and embedding systems. It also discusses efforts to enable web-scale querying through projects like LOD-a-lot that integrate billions of triples for federated querying.
INSPIRE Hackathon Webinar Intro to Linked Data and Semanticsplan4all
This document introduces linked data and the semantic web. It defines linked data as using URIs to identify things on the web and describe them using standard formats like RDF to link related things. This allows data on the web to be treated like a large database. The semantic web builds on linked data principles to publish structured data on the web that can be processed by machines, helping make information more discoverable and science more reproducible. Challenges include agreeing on definitions, performance of query languages, and the effort required to publish high-quality linked data.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
This document provides an introduction to big data and NoSQL databases. It begins with an introduction of the presenter. It then discusses how the era of big data came to be due to limitations of traditional relational databases and scaling approaches. The document introduces different NoSQL data models including document, key-value, graph and column-oriented databases. It provides examples of NoSQL databases that use each data model. The document discusses how NoSQL databases are better suited than relational databases for big data problems and provides a real-world example of Twitter's use of FlockDB. It concludes by discussing approaches for working with big data using MapReduce and provides examples of using MongoDB and Azure for big data.
RDF has several advantages as a data modeling technique including its ability to represent complex relationships as graphs and enable novel inferences. However, graphs also introduce complexity in querying, storage, and visualization. Additional challenges include the open world assumption, managing ontologies and identities, multiple serialization formats, and addressing temporal issues as data changes over time. Overall, proponents argue that the benefits outweigh the limitations, which can be mitigated through tools and standards, and RDF enables a more powerful representation of data on the web compared to alternatives.
NoSQL databases like MongoDB, Elasticsearch, and Cassandra are synonymous with scalability, search, and developer agility. But there’s a downside...having to give up the ease and comfort of SQL.
Or do you?
Join this webcast to learn how the newest databases, like CrateDB and CockroachDB deliver the benefits of NoSQL with the ease of SQL by building SQL engines on top of custom NoSQL technology stacks. Database industry veteran Andy Ellicott, who helped launch Vertica, VoltDB, Cloudant, and now with Crate.io, will provide a no-BS view of current DBMS architectures and predictions for the future of data.
If you’re a DBMS user, this webcast will help you make sense of a very crowded DBMS market and make better-informed decisions for your new tech stacks.
This document discusses image search and analysis techniques for remote sensing data. It describes an index management system that takes in data and indexes it using column-based databases. Images are analyzed to extract features that allow for image search based on compression in compressed streams. Queries can be performed on the indexed data to return similar images based on semantic labels and normalized distances from queries. Examples are provided using different remote sensing datasets, including GeoEye, DigitalGlobe, and TerraSAR-X images.
Context is everything, from the clothing you choose in the morning to the dinner menu you plan based on available ingredients and time. The word on the street is that DITA maps are the express context designed to drive builds for particular deliverables and conditionality for DITA topics. That is partly true, but it is not the whole story.
For one thing, maps are far more versatile than just as build directives. Moreover, DITA topic processing can get its cues from contexts other than maps. And therein hangs the premise of Going Mapless.
To get our own context for this presentation, we start with a quick review of the original architectural definition of DITA and then trace the popular information architectures and tools that have grown up with the standard as we currently know it. Then Don introduces some scenarios where DITA could be useful if freed from the the prevailing map-driven processing paradigm, and he walks you through some available methods and solutions for using DITA in these unconventional ways.
This presentation was given at Information Development World on October 2, 2015.
This document summarizes work being done to express the Data Documentation Initiative (DDI) metadata standard in Resource Description Framework (RDF) format to improve discovery and linking of microdata on the Web of Linked Data. It describes background on the DDI to RDF mapping effort, the goals of making microdata more accessible and interoperable online, and examples of how the RDF representation would support common discovery use cases. It also provides information on tools and next steps for the ongoing work, acknowledging contributions from participants in workshops where this effort was discussed.
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
Research talk presented at "Innovations in Online Research" (October 1, 2021)
Event URL: https://web.cvent.com/event/d063e447-1f16-4f70-a375-5d6978b3feea/websitePage:b8d4ce12-3d02-4d24-897d-fd469ca4808a.
Explainable Fact Checking with Humans in-the-loopMatthew Lease
Invited Keynote at KDD 2021 TrueFact Workshop: Making a Credible Web for Tomorrow, August 15, 2021.
https://www.microsoft.com/en-us/research/event/kdd-2021-truefact-workshop-making-a-credible-web-for-tomorrow/#!program-schedule
Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works:
(1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020.
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
AI & Work, with Transparency & the Crowd Matthew Lease
The document discusses designing human-AI partnerships and the role of crowdsourcing in AI systems. It summarizes work on designing AI assistants to work with humans, using crowds to help fact-check information, and explores challenges around protecting crowd workers who review harmful content or do "dirty jobs". It advocates for more research on ethics in AI and using crowds to help check work for ethical issues.
Designing Human-AI Partnerships to Combat Misinfomation Matthew Lease
The document discusses designing human-AI partnerships to combat misinformation. It describes a prototype partnership where a human and AI work together to fact-check claims. The partnership aims to make the AI more transparent and address user bias by allowing the user to adjust the perceived reliability of news sources, which then changes the AI's political leaning analysis and fact checking results. The discussion wraps up by noting challenges like avoiding echo chambers and assessing potential harms, as well as opportunities for AI to reduce bias and increase trust through explainable, interactive systems.
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
This document summarizes a presentation about designing human-AI partnerships for fact-checking misinformation. It discusses using crowdsourced rationales to improve the accuracy and cost-efficiency of annotation tasks. It also addresses challenges in designing interfaces for automatic fact-checking models, such as integrating human knowledge and reasoning to correct errors and account for bias. The goal is to develop mixed-initiative systems where humans and AI can jointly reason and personalize fact-checking.
Presentation given at the Linguistic Data Consortium (LDC), University of Pennsylvania, April 2019. Based on presentations at the 6th ACM Collective Intelligence Conference, 2018 and the 6th AAAI Conference on Human Computation & Crowdsourcing (HCOMP), 2018. Blog post: https://blog.humancomputation.com/?p=9932.
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
Presented at the 31st ACM User Interface Software and Technology Symposium (UIST), 2018. Paper: https://www.ischool.utexas.edu/~ml/papers/nguyen-uist18.pdf
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Matthew Lease
Presentation at the 1st Biannual Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES 2018). August 30, 2018. Paper: https://www.ischool.utexas.edu/~ml/papers/kutlu-desires18.pdf
Talk given August 29, 2018 at the 1st Biannual Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES 2018). Paper: https://www.ischool.utexas.edu/~ml/papers/lease-desires18.pdf
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Matthew Lease
Presentation at the 6th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), July 7, 2018. Work by Tanya Goyal, Tyler McDonnell, Mucahid Kutlu, Tamer Elsayed, and Matthew Lease. Pages 41-49 in conference proceedings. Online version of paper includes corrections to official version in proceedings: https://www.ischool.utexas.edu/~ml/papers/goyal-hcomp18
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...Matthew Lease
Invited Talk at the ACM JCDL 2018 WORKSHOP ON CYBERINFRASTRUCTURE AND MACHINE LEARNING FOR DIGITAL LIBRARIES AND ARCHIVES. https://www.tacc.utexas.edu/conference/jcdl18
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
Talk given at the 8th Forum for Information Retrieval Evaluation (FIRE, http://fire.irsi.res.in/fire/2016/), December 10, 2016, and at the Qatar Computing Research Institute (QCRI), December 15, 2016.
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
This document discusses opportunities for collaboration between researchers working in systematic reviews and electronic discovery (e-discovery). It notes similarities in the challenges both fields face, including the need for high recall with bounded costs and reliance on multi-stage review pipelines. The document proposes that technologies developed for semi-automated citation screening and crowdsourcing could help address current limitations. It concludes by encouraging information retrieval researchers to investigate open problems in systematic reviews as opportunities to advance technologies beyond other tasks and help bring together interested parties through forums like the TREC Total Recall track.
Crowd computing utilizes both crowdsourcing and human computation to solve problems. Crowdsourcing enables more efficient and scalable data collection and processing by outsourcing tasks to a large, undefined group of people. Human computation allows software developers to incorporate human intelligence and judgment into applications to provide capabilities beyond current artificial intelligence. Examples discussed include Amazon Mechanical Turk, various crowd-powered applications, and how crowdsourcing has helped label large datasets to train machine learning models.
The Rise of Crowd Computing (December 2015)Matthew Lease
Crowd computing is rising with two waves - the first using crowds to label large amounts of data for artificial intelligence applications. The second wave delivers applications that go beyond AI abilities by incorporating human computation. Open problems remain around ensuring high quality outputs, task design, understanding the worker context and experience, and addressing ethics concerns around opaque platforms and working conditions. The future holds potential for empowering crowd work but also risks like digital sweatshops if worker freedoms and conditions are not considered.
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsMatthew Lease
The document summarizes a presentation about analyzing paid crowd work platforms beyond Mechanical Turk. It discusses how Mechanical Turk has dominated research on paid crowdsourcing due to its early popularity, but that it has limitations. The presentation conducts a qualitative study of 7 alternative crowd work platforms to identify distinguishing capabilities not found on MTurk, such as different payment models, richer worker profiles, and support for confidential tasks. It aims to increase awareness of other platforms to further inform practice and research on crowdsourcing.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Presentation of the OECD Artificial Intelligence Review of Germany
Discovering Memes in Social Media
1. Discovering Memes in Social Media
Matt Lease
School of Information
University of Texas at Austin
ml@ischool.utexas.edu
@mattlease
Joint Work with
Hohyon Ryu & Nicholas Woodward
Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
2. Memes
• Short, similar phrases found in
many different sources
– Re-use, shared temporal context
• Evolutionary mutation &
propagation as they transmit
from source-to-source
• Reveals implicit connections
between sources, individuals
and communities involved
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2
4. Google/NYT Living Stories
livingstories.googlelabs.com
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4
5. Related Work
• Jure Leskovec et al. (KDD’09): blogs
– quotations only: http://memetracker.org
• Steven Skiena, Stony Brook NY: blogs
– Named-entities only: http://www.textmap.com
• O. Kolak and B. Schilit (HT’08): scanned books
– Mine “popular passages” from complete texts
– MapReduce “shingling” approach
– Popular passages found are local, not global
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5
6. MapReduce @ UT
• UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10
• New harddisks @ TACC Longhorn installed Dec.’10
– 48 Dell R610 nodes
• 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz
• 48GB RAM with ~1.5TB disk per node
• With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers
– 16 Dell R710 (same CPU configuration)
• 144GB RAM with ~0.8TB disk per node
– Setup Hadoop, testing, benchmarking, etc.
• Baldridge & Lease teach MapReduce class Fall’11
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6
7. Datasets
• TREC Blogs08 Collection
– http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
– 28M permalinks (January 2008 – January 2009)
– 250G compressed
• ICWSM 2009 Spinn3r Blog Dataset
– http://www.icwsm.org/data/
– 44 million blog posts (August - September, 2008)
– 27 GB compressed
• ICWSM 2011 Spinn3r Blog Dataset
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7
8. Processing Architecture
Blogs08 Test Collection
28M posts, 1.4TB
Preprocessing (Pseudo-MapReduce)
Decruft & Language Identification
HTML Strip & Near-Duplicate Detection 16M posts, 960GB
Common Phrase Extraction
15K posts, 43GB
3 MapReduce Stages
Common Phrase Ranking
Daily Top 200 Phrases 6.2M phrases, 2GB
1 MapReduce Process
Common Phrase Clustering
75K phrases, 2.6MB
1 MapReduce Process
Meme Browser
68K memes
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8
9. Creating the Shingle Table
• e.g. trigram shingles for: what do you think of
– what do you
– do you think
– you think of
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9
11. Common Phrase (CP) Detection
• Mapper:
Merge adjacent
shingles into memes
(ignoring small gaps)
• Reducer:
Find set of
documents in which
each meme occurs
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11
18. Thank You!
• Joint Work with Matt Lease
– Hohyon (Will) Ryu ml@ischool.utexas.edu
• InfoChimps (Summer’11) www.ischool.utexas.edu/~ml
• Indeed.com (Summer’12) @mattlease
– Nicholas Woodward (TACC)
• Latin American Network
Information Center (LANIC) Support
• FCT of Portugal / UT CoLab
• Amazon Web Services
• UT Austin LIFT Award
• John P. Commons Fellowship