The document describes a project to identify and extract images from a collection of 19th century scanned books. Researchers used computer vision algorithms to detect faces in the books and found that female faces were detected more often than males. Over 580GB of images were extracted and uploaded to Flickr Commons where they received over 55 million views within 5 days. Workers and queues were used to distribute the image processing and uploading tasks. Ongoing monitoring is done to track changes to the images on Flickr.
Web History 101, or How the Future is UnwrittenBookNet Canada
In 1989 computer scientist Tim Berners-Lee wrote “Information Management: A Proposal” to persuade CERN management that a global hypertext system was in their interests. That proposal gradually grew into what we now call the World Wide Web. This originating document contains not only the bits that would later become the Web, but also features for a future we’ve yet to realize. In this talk, we’ll take a look at some of those highlights and focus them on the world of publishing, proposing solutions to problems we’re still attempting to solve and fostering ideas for further daydreaming.
Spiders, Chatbots, and the Future of Metadata: A look inside the BNC BiblioSh...BookNet Canada
BookNet’s BiblioShare database now holds over 2 million public records and counting – so what are we doing with all that bibliographic data? Or better yet: what aren’t we doing? Join Tim as he demonstrates a few in-progress tools and blue-sky possibilities that put all that data to good use.
Web History 101, or How the Future is UnwrittenBookNet Canada
In 1989 computer scientist Tim Berners-Lee wrote “Information Management: A Proposal” to persuade CERN management that a global hypertext system was in their interests. That proposal gradually grew into what we now call the World Wide Web. This originating document contains not only the bits that would later become the Web, but also features for a future we’ve yet to realize. In this talk, we’ll take a look at some of those highlights and focus them on the world of publishing, proposing solutions to problems we’re still attempting to solve and fostering ideas for further daydreaming.
Spiders, Chatbots, and the Future of Metadata: A look inside the BNC BiblioSh...BookNet Canada
BookNet’s BiblioShare database now holds over 2 million public records and counting – so what are we doing with all that bibliographic data? Or better yet: what aren’t we doing? Join Tim as he demonstrates a few in-progress tools and blue-sky possibilities that put all that data to good use.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
The Elephant in the Library - Integrating Hadoopcneudecker
The Elephant in the Library - Integrating Hadoop
[with Sven Schlarb]
Hadoop Summit Europe, Beurs van Berlage, 20-21 March 2013, Amsterdam, Netherlands.
From a student to an apache committer practice of apache io tdbjixuan1989
This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
This slide introduces various kinds of basic steganography techniques.
Also, the tools that could be useful for CTF(Capture the Flag) stegano challenges are listed
Deep Learning is the area of machine learning and one of the most talked about trends in business and computer science today.
In this talk, I will give a review of Deep Learning explaining what it is, what kinds of tasks it can do today, and what it probably could do in the future.
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Matthew Hale from the Kings Fund provided an interesting talk about how they implemented Hyku - an open source online archive solution and how it integrates with Koha
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardDocker, Inc.
Container technology is being used to answer some of the biggest questions in science today - what is the Universe made of? How has it evolved over time? Scientists use vast quantities of data to study these questions, and analyzing this data requires Big Data solutions on high performance computing resources. In this talk we discuss why containers are being deployed on the Cori supercomputer at NERSC (the National Energy Research Scientific Computing center) to answer fundamental scientific questions. We will give examples of the use of Docker in simulating complex physical processes and analyzing experimental data in fields as diverse as particle physics, cosmology, astronomy, genomics and material science. We will demonstrate how container technology is being used to facilitate access to scientific computing resources by scientists from around the globe. Finally, we will discuss how container technology has the potential to revolutionize scientific publishing, and could solve the problem of scientific reproducibility.
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15MLconf
Deep ML Architecture at Wildcard: At Wildcard we think about technologies for a future native mobile web experience through cards. Cards are a new UI paradigm for content on mobile for which we schematize unstructured web content. Part of the challenge is to develop an understanding of online content through machine learning algorithms. The extracted information is used to create cards that are surfaced in the Wildcard iOS app and in other card ecosystems. I will describe the challenge and the way we structure the problem of content extraction with a deep architecture of classification and optimization algorithms that combines traditionally factorized problems of content extraction which allows the various stages to inform each other. The talk will include an overview of the used data, features and our training strategy with a partly human-powered labeling system. This ML system, called sic, is used in production and I will show our approach to using only fast or a mix of fast and slow features depending on the use case in the app.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
The Elephant in the Library - Integrating Hadoopcneudecker
The Elephant in the Library - Integrating Hadoop
[with Sven Schlarb]
Hadoop Summit Europe, Beurs van Berlage, 20-21 March 2013, Amsterdam, Netherlands.
From a student to an apache committer practice of apache io tdbjixuan1989
This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
This slide introduces various kinds of basic steganography techniques.
Also, the tools that could be useful for CTF(Capture the Flag) stegano challenges are listed
Deep Learning is the area of machine learning and one of the most talked about trends in business and computer science today.
In this talk, I will give a review of Deep Learning explaining what it is, what kinds of tasks it can do today, and what it probably could do in the future.
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Matthew Hale from the Kings Fund provided an interesting talk about how they implemented Hyku - an open source online archive solution and how it integrates with Koha
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardDocker, Inc.
Container technology is being used to answer some of the biggest questions in science today - what is the Universe made of? How has it evolved over time? Scientists use vast quantities of data to study these questions, and analyzing this data requires Big Data solutions on high performance computing resources. In this talk we discuss why containers are being deployed on the Cori supercomputer at NERSC (the National Energy Research Scientific Computing center) to answer fundamental scientific questions. We will give examples of the use of Docker in simulating complex physical processes and analyzing experimental data in fields as diverse as particle physics, cosmology, astronomy, genomics and material science. We will demonstrate how container technology is being used to facilitate access to scientific computing resources by scientists from around the globe. Finally, we will discuss how container technology has the potential to revolutionize scientific publishing, and could solve the problem of scientific reproducibility.
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15MLconf
Deep ML Architecture at Wildcard: At Wildcard we think about technologies for a future native mobile web experience through cards. Cards are a new UI paradigm for content on mobile for which we schematize unstructured web content. Part of the challenge is to develop an understanding of online content through machine learning algorithms. The extracted information is used to create cards that are surfaced in the Wildcard iOS app and in other card ecosystems. I will describe the challenge and the way we structure the problem of content extraction with a deep architecture of classification and optimization algorithms that combines traditionally factorized problems of content extraction which allows the various stages to inform each other. The talk will include an overview of the used data, features and our training strategy with a partly human-powered labeling system. This ML system, called sic, is used in production and I will show our approach to using only fast or a mix of fast and slow features depending on the use case in the app.
Some collected uses of the British Library Flickr collection, illustrating how a new presentation changed its usage.
Outlines the existence of collection bias, especially in digitised material.
Talk given at Te Papa, for the NDF NZ. The video of the talk is inserted here before the slides themselves.
Direct link to the video of the talk: https://www.youtube.com/watch?v=bIXB0ROyxcY
An Overview of the area and the current potential for the open technologies to be used, and some suggestions as to why they are not as heavily used as they should be.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
2. It began with dogfood...
• "Given access to a filesystem of media
with an easily learned layout
convention, can a researcher use their
own tools?"
3. It began with dogfood...
• "Given access to a filesystem of media
with an easily learned layout
convention, can a researcher use their
own tools?"
• So we contrived a research question:
4. "Can we find the faces in the
19th C scanned book
collection?"
5.
6. Outcome:
• Majority of tools and libraries expect local
filesystem or in-memory access; no
network/API knowledge needed by
researcher.
• While lookup by layout is awkward, it is a
pragmatic approach when distributing
content by sneakernet. Might be pairable
by a light online search-engine and
documentation/wiki for best practices.
7. 'Project' success?
• Computer Vision algorithms are
predominantly based on photographic
input. Room for improvement.
• Catch-22 with respect to training sets.
8. 'Project' success?
• Computer Vision algorithms are
predominantly based on photographic
input. Room for improvement.
• Catch-22 with respect to training sets.
• But... applying Haar cascade profiles,
based on a photo training set, had some
reasonable success!
9. 19C depictions of faces
• Likelyhood of detection:
• Female faces > Male
10. 19C depictions of faces
• Likelyhood of detection:
• Female faces > Male
• Why women?
• Drawn more symmetrically - male faces were
more likely to be exaggerated.
• Depiction is typically 'clean' and posed
• Fashion: beards, spectacles and hats - very
different to the training sets
11. An Interesting By-product emerged
• The ALTO XML, created by MS as part of
the digitisation process, was found to have
'GraphicalIllustration' elements.
12. An Interesting By-product emerged
• The ALTO XML, created by MS as part of
the digitisation process, was found to have
'GraphicalIllustration' elements.
– polygonal boundaries for areas where it
detected contiguous content but where OCR
didn't work.
13. A map to all* the images?
* Unlikely to be comprehensive
14. A map to all* the images?
The 'Mechanical Curator' found:
– Maps
– Portraits
– Marginalia
– Covers
– Charts and diagrams
– Decorations
15.
16.
17. Microsoft Books
• Context:
– 47k 'works' digitised, 68k volumes
– 15.3Tb images, 1.3Tb ALTO XML
– circa 22+ million JP2000 images, 150-200DPI
(unconfirmed), a zipfile ('store') per volume
– 360 pages per volume on average
– No explicit subjects in metadata, but heavy on
travel, geography, ethnology, (English)
literature and plenty of 'misc'
18. Accessible?
• In theory, the books were accessible
online.
• In practice, it was a real challenge to find
anything viewable.
19. Image extraction process
• Worker-based, using a message queue to
coordinate.
• Thread-unsafe (due to zips) so limited to
one worker per zip.
– Local network storage was nearly full
– Limited by hardware too (4 months to get
RAM upgrade)
20. Tech used:
• Virtualbox
• Redis (msg queue, semaphore, metadata
cache)
• Python
– OpenCV main library used:
• Opens JP2000 with colour profiles
• Quick to work with image regions
• Also saved region as JPG (92%) for reuse
21. Filter first!
• ALTO with Illustration element is only
concern.
• Grep - quickly discerned the 1 million XML
files of interest (only 4-5% of total)
24. Resilience
• Never trust a process
– Did it fail?
– Did it fail silently?
– Does the expected JPG exist on disc? Is it
non-zero in length?
25. Resilience
• Never trust a process
– Did it fail?
– Did it fail silently?
– Does the expected JPG exist on disc? Is it
non-zero in length?
– Did IT services hard reboot your desktop
machine hosting the VMs you use in a given
night?
26. Overview:
• Started with one desktop VM, and a
connection to a local NAS
• Ended having used multiple VMs on Azure
as well, after piping content to their store.
– Redis replicated natively w/ SSH tunnel to
write node
27. Identifiers...
• Little help available from overstretched IT
architecture team.
• Naive filename syntax to begin with:
– SYSNUM_VOL_PG_IMGIDX_humantxt.jpg
– Stored by publication year.
28. We have images!
• 580Gb JPGs
• From dogfooding, hybrid approach
seemed necessary:
• Online, sharable, linkable, easy to find
presence, with a unique ID per image.
• Easy mapping between local image and
online image.
30. Options
• Wikimedia Commons: we know about the
books, but have no idea about the actual
content! WC wouldn't be able to handle
1mil images in one go.
• Er... Flickr?
31. Upload by worker
• Again, similar structure - job was simply a
filepath (metadata deduceable)
• Ran approximately 16-18 workers for 9
days to upload images.
• High 90s upload success rate (time of day
dependent)
32. Outcome
• Launched 13 December on Flickr
Commons
• Spike: 55 million image views in 5 days
• By March 2014, 70k+ tags added by
community -
map, portrait, cover, childrensbook, and so
on.
33.
34. Keeping track
• Many bad/misleading API calls
• (people.photos.)recentlyUpdated seems to
mostly work
35. Current scheme
• Every morning, call recentlyUpdated for
list of images that had some change
• Re-scan images and deduce changes in
tags, comments, views and favourites.
– (Same pattern, rescan jobs taken by
get_activity workers. Running 4 is enough
outside of spike times)
36. Caching
• Redis sets:
– PeopleID links to set of FlickrID+tagadded
– FlickrID links to set of user tags
– Sorted sets for 'high score' lists:
contributors, favourites, tags
37. Summary
• Workers to spin up when required
• Variety of workers, variety of queues
• Never trust a worker or process
• Never trust an API
• Sample where you can't test.