A discussion of Text and Data Mining in science and at Springer Nature in particular. As presented at the Frankfurt Book Fair 2018 by Markus Kaindl, Senior Manager Semantic Data, Springer Nature.
Unleashing the Power of GPT & LLM: A Holland & Barrett ExplorationDobo Radichkov
Join Holland & Barrett's innovative journey into the world of Generative Pre-trained Transformers (GPT) and Large Language Models (LLM).
In this presentation, we delve into the promise these models hold for personal productivity and beyond. Our Data team spearheads this exploration, highlighting potential applications across diverse roles, from Editors to Engineers.
Discover how we're formulating best practices, developing guiding frameworks, and innovating for the future. Whether you're new to the world of GPT and LLM or an expert, gain valuable insights from our experiences at Holland & Barrett.
What is the reproducibility crisis in science and what can we do about it?Dorothy Bishop
Talk given to the Rhodes Biomedical Association, 4th May 2016.
For references see: http://www.slideshare.net/deevybishop/references-on-reproducibility-crisis-in-science-by-dvm-bishop
Using Data Strategy Design to Build Data-Driven ProductsDatentreiber
Everyone is talking about Big Data, Deep Learning and Artificial Intelligence. But the reality in some companies looks different, especially when developing new products: (the relevant) data is missing. Without predictive models and recommendation systems cannot be trained and the value is consequently low. This so called cold-start problem is especially concerning startups, since without own data treasure the companies are missing a defendable unique value proposition. Successful startups solve this problem with the help of „Data Traps“ and develop products with „Data Network Effects“. What exactly stands behind these terms and how companies design their own successful and data-driven products, will be demonstrated by Martin Szugat based on samples from his occupation as Data Strategy Consultant.
Unleashing the Power of GPT & LLM: A Holland & Barrett ExplorationDobo Radichkov
Join Holland & Barrett's innovative journey into the world of Generative Pre-trained Transformers (GPT) and Large Language Models (LLM).
In this presentation, we delve into the promise these models hold for personal productivity and beyond. Our Data team spearheads this exploration, highlighting potential applications across diverse roles, from Editors to Engineers.
Discover how we're formulating best practices, developing guiding frameworks, and innovating for the future. Whether you're new to the world of GPT and LLM or an expert, gain valuable insights from our experiences at Holland & Barrett.
What is the reproducibility crisis in science and what can we do about it?Dorothy Bishop
Talk given to the Rhodes Biomedical Association, 4th May 2016.
For references see: http://www.slideshare.net/deevybishop/references-on-reproducibility-crisis-in-science-by-dvm-bishop
Using Data Strategy Design to Build Data-Driven ProductsDatentreiber
Everyone is talking about Big Data, Deep Learning and Artificial Intelligence. But the reality in some companies looks different, especially when developing new products: (the relevant) data is missing. Without predictive models and recommendation systems cannot be trained and the value is consequently low. This so called cold-start problem is especially concerning startups, since without own data treasure the companies are missing a defendable unique value proposition. Successful startups solve this problem with the help of „Data Traps“ and develop products with „Data Network Effects“. What exactly stands behind these terms and how companies design their own successful and data-driven products, will be demonstrated by Martin Szugat based on samples from his occupation as Data Strategy Consultant.
Build Intelligent Fraud Prevention with Machine Learning and GraphsNeo4j
See how financial services, banking and retail are using graph-enhanced machine learning to thwart fraud. Fraudsters are becoming increasingly sophisticated, organized and adaptive; traditional, rule-based solutions are not broad or nimble enough to deal with this reality. This session will cover several demonstrations and real-world technical examples including preventing credit card fraud, identifying money laundering and reducing false positives.
In this deck I discuss why, and how data makes me tick. I've been doing work with data for decades but it's more exciting now than ever. Why do I get nerdy excitement from data? Because of my personal data, because data fuels change, because data is used for good. I'm immersed in data that has purpose on a daily basis.
Can you distinguish between data that has purpose and data that fuels vanity? Can you tell when data is for activation and when it's for reporting? Which data gets you promoted and which data gets you bogged down? A click bait teaser - you already know the answer...
Chris Attewell - Global Sales Director - Search Laboratory
The Google Analytics 360 Suite is designed to deliver deeper insights into when, why and how your customers buy from or engage with you, so you can make better marketing decisions based on accurate data. Chris will give a brief introduction to the Suite and why large companies are choosing Google Analytics 360.
Top contenders in the 2015 KDD cup include the team from DataRobot comprising Owen Zhang, #1 Ranked Kaggler and top Kagglers Xavier Contort and Sergey Yurgenson. Get an in-depth look as Xavier describes their approach. DataRobot allowed the team to focus on feature engineering by automating model training, hyperparameter tuning, and model blending - thus giving the team a firm advantage.
Calibration exercise on GA4 observed vs modelled vs GA 360 data. The methodology was flawed in this case but the results and approach stand up to scrutiny
Data science skills are increasingly important for research and industry projects. With complex data science projects, however, come complex needs for understanding and communicating analysis processes and results. The rise of data science has accompanied a comparable rise in business intelligence and the demand for visualizations and dashboards that can explain models, summarize results, assist with decision making, and even predict outcomes. Ultimately, an analyst’s data science toolbox is incomplete without visualization skills. This talk will explore the landscape of visualization for data science – using visualization for data exploration and communication, reproducible approaches to visualization, and how to develop better instincts for visualization choice and graphic design.
GPT and Graph Data Science to power your Knowledge GraphNeo4j
In this workshop at Data Innovation Summit 2023, we demonstrated how you could learn from the network structure of a Knowledge Graph and use OpenAI’s GPT engine to populate and enhance your Knowledge Graph.
Key takeaways:
1. How Knowledge Graphs grow organically
2. How to deploy Graph Algorithms to learn from the topology of a graph
3. Integrate a Knowledge Graph with OpenAI’s GPT
4. Use Graph Node embeddings to feed Machine Learning workflow
Build Intelligent Fraud Prevention with Machine Learning and GraphsNeo4j
See how financial services, banking and retail are using graph-enhanced machine learning to thwart fraud. Fraudsters are becoming increasingly sophisticated, organized and adaptive; traditional, rule-based solutions are not broad or nimble enough to deal with this reality. This session will cover several demonstrations and real-world technical examples including preventing credit card fraud, identifying money laundering and reducing false positives.
In this deck I discuss why, and how data makes me tick. I've been doing work with data for decades but it's more exciting now than ever. Why do I get nerdy excitement from data? Because of my personal data, because data fuels change, because data is used for good. I'm immersed in data that has purpose on a daily basis.
Can you distinguish between data that has purpose and data that fuels vanity? Can you tell when data is for activation and when it's for reporting? Which data gets you promoted and which data gets you bogged down? A click bait teaser - you already know the answer...
Chris Attewell - Global Sales Director - Search Laboratory
The Google Analytics 360 Suite is designed to deliver deeper insights into when, why and how your customers buy from or engage with you, so you can make better marketing decisions based on accurate data. Chris will give a brief introduction to the Suite and why large companies are choosing Google Analytics 360.
Top contenders in the 2015 KDD cup include the team from DataRobot comprising Owen Zhang, #1 Ranked Kaggler and top Kagglers Xavier Contort and Sergey Yurgenson. Get an in-depth look as Xavier describes their approach. DataRobot allowed the team to focus on feature engineering by automating model training, hyperparameter tuning, and model blending - thus giving the team a firm advantage.
Calibration exercise on GA4 observed vs modelled vs GA 360 data. The methodology was flawed in this case but the results and approach stand up to scrutiny
Data science skills are increasingly important for research and industry projects. With complex data science projects, however, come complex needs for understanding and communicating analysis processes and results. The rise of data science has accompanied a comparable rise in business intelligence and the demand for visualizations and dashboards that can explain models, summarize results, assist with decision making, and even predict outcomes. Ultimately, an analyst’s data science toolbox is incomplete without visualization skills. This talk will explore the landscape of visualization for data science – using visualization for data exploration and communication, reproducible approaches to visualization, and how to develop better instincts for visualization choice and graphic design.
GPT and Graph Data Science to power your Knowledge GraphNeo4j
In this workshop at Data Innovation Summit 2023, we demonstrated how you could learn from the network structure of a Knowledge Graph and use OpenAI’s GPT engine to populate and enhance your Knowledge Graph.
Key takeaways:
1. How Knowledge Graphs grow organically
2. How to deploy Graph Algorithms to learn from the topology of a graph
3. Integrate a Knowledge Graph with OpenAI’s GPT
4. Use Graph Node embeddings to feed Machine Learning workflow
Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015KISK FF MU
Talk given at the BOBCATSSS 2015 conference - http://www.bobcatsss2015.com/.
In the contribution “Features for the Future Library” the German project “mylibrARy” will be introduced, which is a cooperation project between the University of Applied Sciences in Potsdam, a public library in Berlin and one of the leading AR-software companies metaio GmbH from Munich. The conceptual process of a library AR-app will be presented as well as the results of a user study, which might give an answer to the question, what features of an app the library users want.
Furthermore the possibilities of AR-technology for libraries in general will be discussed and contextualized within the concept of a modern user-friendly library.
QURATOR: A Flexible AI Platform for the Adaptive Analysis and Creative Genera...Georg Rehm
Georg Rehm. QURATOR: Developing a Flexible AI Platform for Digital Content Curation. QURATOR 2020 – Conference on Digital Curation Technologies., 1 2020. Fraunhofer FOKUS, January 20/21, 2020. Invited keynote talk.
Supporting Springer Nature Editors by means of Semantic TechnologiesFrancesco Osborne
The Open University and Springer Nature have been collaborating since 2015 in the development of an array of semantically-enhanced solutions supporting editors in i) classifying proceedings and other editorial products with respect to the relevant research areas and ii) taking informed decisions about their marketing strategy. These solutions include i) the Smart Topic API, which automatically maps keywords associated with published papers to semantically characterized topics, which are drawn from a very large and automatically-generated ontology of Computer Science topics; ii) the Smart Topic Miner, which helps editors to associate scholarly metadata to books; and iii) the Smart Book Recommender, which assists editors in deciding which editorial products should be marketed in a specific venue.
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...Dr. Haxel Consult
Henning Schönenberger (SpringerNature, Germany)
Springer Nature SciGraph, the new Linked Open Data platform aggregating data sources from Springer Nature and key partners from the scholarly domain. The Linked Open Data platform will initially collate information from across the research landscape, such as funders, research projects, conferences, affiliations and publications. Additional data, such as citations, patents, clinical trials and usage numbers will follow over time. This high quality data from trusted and reliable sources provides a rich semantic description of how information is related, as well as enabling innovative visualizations of the scholarly domain.
A Corpus of Chinese Comic Books: Database, Metadata, and Visual Object Recogn...Matthias Arnold
Chinese comics mostly appeared in the form of small rectangular pocket-books (ca. 13 x 9 cm) with stories containing a picture on the top and a short narrative on the bottom. They are called “chain(ed) pictures” lianhuanhua 连环画 or, alternatively, “small people’s books” xiaorenshu 小人书. These have been the dominant form of comic popular in China since the 1910s and, at least until the late 1980s They introduced generations of Chinese to literature as well as propaganda texts. Especially popular in the 1960s and 1970s, they were the dominating popular medium in everyday life and helped to form a cultural memory. Since the 1990s they have almost completely disappeared and can only be found in private collections today. Academic interest only gradually evolves with researcher focusing on individual artists or publishing systematic genre overviews. The Heidelberg collection is an attempt to join some of the scattered holdings in Germany to form a basis for future research. It comprises the scans of over 1200 comic books, which are combined with systematic metadata recordings and a basic contents analysis through keywords. The focus of our endeavor was on comics from the second half of the Cultural Revolution and immediately thereafter.
In my presentation I will introduce the medium lianhuanhua, discuss aspects of metadata recording and database development, as well as report on an experimental approach to automatically detect visual objects in the images.
Workshop "Comics Annotation: Designing Common Frameworks for Empirical Research", Potsdam 2018-06-18/19
This presentation was provided by Markus Kaindl of Springer Nature, during the NISO event "Transforming Search: What the Information Community Can and Should Build." The virtual conference was held on August 26, 2020.
This presentation, hold during Semantcs conference, introduce Ontos' current achievement towards a Streaming-based Text Mining solution by using Deep Learning and Semantic Web technologies.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
1. Text and Data Mining
at Springer Nature
Markus Kaindl
Senior Manager Semantic Data
Frankfurt Book Fair
October 11th 2018
2. 1
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
1
Agenda (25 Min Talk + 5 Min Q&A)
Background
• General introduction to TDM
• Use cases in science and beyond
TDM @ SN
• Springer Nature TDM Framework
• APIs and query examples
Wrap Up
• Closing remarks
• Q&A session
3. 22
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
Background
4. 33
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
General
introduction
to TDM
5. 4
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
4
Text and Data Mining (TDM) is the
(semi-)automated process of selecting
and analyzing large amounts of text or
data resources for purposes such as
• searching,
• finding patterns,
• discovering relationships,
• semantic analysis and
• learning how content relates to
ideas and needs
in a way that can provide valuable
information needed for studies,
research, etc.
What is Text and Data Mining?
6. 5
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
5
The variety of TDM techniques, practices and end-goals makes it virtually impossible to
provide a general and exhaustive illustration of how TDM works.
By means of a necessary simplification, it appears however possible to distinguish three
common – yet not all necessary – steps in TDM processes:
1. Access to content;
2. Extraction and/or copying of content;
3. Mining of text and/or data and knowledge discovery.
Different steps in the process of TDM
Text taken from:
Dr Eleonora Rosati, Associate Professor in Intellectual Property Law at the University of Southampton
Policy Department for Citizens' Rights and Constitutional Affairs; European Parliament; PE 604.942
7. 6
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
Step 3 - Mining of text and/or data and knowledge discovery
Image taken from:
Dr Eleonora Rosati, Associate Professor in Intellectual Property Law at the University of Southampton
Policy Department for Citizens' Rights and Constitutional Affairs; European Parliament; PE 604.942
8. 77
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
Use cases in
science and beyond
9. 8
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
8
• Searching and archiving
• Metadata display
• Knowledge retrieval
• Multisource library
• Online portals
• Apps, products and solutions
• Fraud detection (banking, insurance) and spam filtering
- Some build in our APIs into their local search facility to simply improve their
researchers’ discovery experience, some others do specific term or data mining.
- Some go very deep into semantic text mining, some others want to retrieve hidden
research data structures as part of given science programs.
- Some even may use TDM to investigate competitor activities and staff.
Text and Data Mining => Use Cases
10. 9
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
• User runs search on “Clinical
Orthopaedics and Related
Research” journal
portal, e. g. hip cartilage
• Though the user is
searching the portal
our API is being called
and the latest results are
returned on the fly.
Examplatory query on an external journal portal
11. 10
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
Research Use Cases
12. 11
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
11
- Assist banks with traditional analyses of industries and sectors
- Serve marketing to visually track information spread and influence paths across
media segments in real-time
- In the field of criminology, it is believed that TDM can help exploring and detecting
crimes and their relationships with criminals, so to assist police forces.
- Support anthropologists in their studies of cultural phenomena by mining the ever-
growing content made available through social networks
- Please note that the same technology may perform TDM for different purposes:
For instance, IBM Watson Explorer has been used –among others– to:
- Improve productivity in the workplace
- Increase efficiency in public health management
- Improve diagnostic information
- Allow for tailored drug recommendations
- Prevent the commission of crimes, including cybercrimes and hacks
- Create new culinary recipes
TDM use cases not limited to the field of research
13. 12
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
12
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
TDM @ SN
14. 1313
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
Springer Nature
TDM Framework
15. 14
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
14
Text and Data Mining => A TDM Framework for Springer Nature
https://www.springernature.com/text-and-data-mining
16. 15
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
15
• A growing part of Springer Nature’s journal articles is published open access.
• TDM is usually allowed without restrictions for these publications since the majority
of Springer Nature open access content is licensed under CC-BY.
Text and Data Mining
17. 16
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
16
• Springer Nature is participating in the Crossref TDM working group and is
recommending Crossref services for pan-publisher TDM.
• For further information see http://tdmsupport.crossref.org/
Text and Data Mining at CrossRef
18. 17
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
17
• For subscribers Springer Nature offers a large variety of TDM tools such as metadata
and fulltext APIs, applicable to both open access and subscribed resources.
• See https://api.springernature.com
Text and Data Mining for Subscribers
19. 18
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
18
• Non-subscribers are offered a variety of TDM tools for our Open Access resources,
such as our Open Access fulltext API (see https://api.springernature.com).
• TDM requirements from non-subscribers for pay-walled content are treated on a
case-by-case basis.
• Please contact tdm@springernature.com.
Text and Data Mining for Non-Subscribers
20. 19
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
19
• At the moment Springer Nature does not offer any API for image mining.
Image Mining
21. 20
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
20
• “Argumentation mining aims to automatically detect, classify and structure
argumentation in text (…) i.e. understanding the content of serial arguments, their
linguistic structure, the relationship between the preceding and following arguments
(…).” (Mochales, R. & Moens, MF. Artif Intell Law (2011) 19: 1.
https://doi.org/10.1007/s10506-010-9104-x)
• Argumentation mining can be considered as a subset of text mining. If you are
planning to locally store non-open-access content during an argumentation mining
project, please get in contact with tdm@springernature.com to discuss options.
Argumentation Mining
22. 2121
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
APIs and
query examples
23. 22
• metadata: general xml metadata using PAM standard (Prism Aggregate format); uses prism and
dc (dublincore) namespaces and elements. In use but fixed (i.e., no changes will be made).
• meta/v1: based on metadata API. Has PAM standard but additional elements added for various
business reasons. Fluid API to which changes can be easily made.
• openaccess: complete, “fulltext” (i.e., body) xml where available; when fulltext/body not
available, xml metadata is provided.
• integro: journal-level information. Based on SpringerLink journal xml.
• xmldata: JATS format for fulltext as available; else metadata.
• Throttle: recommended crawling rate of 150 hits/minute and 7500/day.
• Key: special key provided for TDM.
• All APIs have xml and json except for xmldata and integro.
• A full picture of Springer Nature’s API offerings with detailed information, examples and API key
sign-up can be accessed via https://api.springernature.com.
Springer Nature APIs
25. 24
• Citations APIs (internal)
• Article/chapter count: returns the number of citations for a given DOI.
• Book-level count: returns total count of citations for all the chapters in a book.
• Journal-level count: returns total count of citations for all the articles in a journal.
• Show citations: returns actual citations for a given DOI.
• Recent citations: provides recent citation data from Crossref.
Springer Nature APIs
26. 25
- 1 billion facts / 200 gb download size
- Licensing: CC0, CC-BY-NC & CC-BY
Metadata about:
- Journals & Articles (8M) + Abstracts
- Books & Chapters (4M)
- Grants (200k), Research Organizations,
Subjects, Conferences, Ontologies
SN SciGraph APIs: Linked Data API or
Redirect API; more details here
http://scigraph.springernature.com/explorer/api/
Linked Open Data Publishing So Far
http://scigraph.springernature.com
28. 27
• Example Queries (cont’d)
• Volume, issue:
http://api.springer.com/openaccess/jats?q=journal%3A%22CardioVascular%20and%20Interve
ntional%20Radiology%22%20volume%3A34%20issue%3A6&p=1&api_key=X
• Date: http://api.springer.com/metadata/pam?q=date:2010-03-01&api_key=X
• Issue type (meta, metadata):
http://api.springer.com/meta/v1/pam?q=journal%3A%22CardioVascular%20and%20Interventi
onal%20Radiology%22%20volume%3A37%20issue%3A2%20issuetype%3ARegular%20&api_ke
y=X
• Journalid (meta, metadata):
http://api.springer.com/meta/v1/pam?q=journalid%3A270%20heart%20%20sort%3Adate%20
%20&api_key=X
• Open Access (meta):
http://api.springer.com/meta/v1/pam?q=journalid:259%20openaccess:true&api_key=X
Springer Nature APIs => General Constraints
29. 28
near query
• The purpose of the “near” query is to find terms within x number of words from each other. In the
highlighted case, we use “engineering” which can have various meanings and contexts:
• The “engineering NEAR/3 locomotive” API call
http://api.springer.com/xmldata/jats?q=engineering%20NEAR/3%20locomotive&api_key=X&p=5
yields four results where “engineering” is within 3 words of “locomotive” which is what was specified in
the API call.
• The “engineering NEAR/5 metals” API call
http://api.springer.com/xmldata/jats?q=engineering%20NEAR/5%20metals&api_key=X&p=5
yields four results where “engineering” is within 5 words of “metals”.
32. 31
xmldata/fulltext API
• http://api.springer.com/xmldata/jats?q=doi:10.1007/s11258-006-9242-0&api_key=X
• http://api.springer.com/xmldata/jats?q=journalid:270%20year:2017&api_key=X&s=1&p=3
• http://api.springer.com/xmldata/jats?q=isbn:978-1-59259-559-4&api_key=X&s=1&p=3
• Returns fulltext JATS xml as available; for internal use (except as indicated elsewhere).
• Querying constraints
• doi
• journalid
• issn
• journal
• isbn
• book
• pub (publication)
• title
• name
• subject
• date
• year
• keyword
• type (Book or Journal)
• country
• JournalOnlineFirst (true or false)
• Orgname
• empty (no constraint, general
search)
33. 32
Open Access API
• http://api.springer.com/openaccess/jats?q=issn:1432-086X&api_key=X
• Returns fulltext JATS xml for Open Access articles and chapters
34. 33
Citations APIs (for Journal Articles)
• Article/chapter count: returns the number of citations for a given DOI
• http://citationsapi.nkb3.org/article/citations/count?doi=10.1007/s00216-007-1222-2
• Show citations: returns actual citations for a given DOI.
• http://citationsapi.nkb3.org/article/citations?doi=10.1007/s00216-007-1222-2
• Data for citations APIs comes from
the responses to requests we make
to Crossref to find out who is citing
our data.
35. 34
Citations APIs (for Book Chapters and Journals)
• Book-level count: returns total count of citations for all the chapters in a book.
• http://citationsapi.springerservice.com/book/citations/count?doi=10.1007/978-3-642-24040-9
• Journal-level count: returns total count of citations for all the articles in a journal.
• http://citationsapi.springerservice.com/journal/367/citations/count
• Citations Combination Count and Display API:
• http://citationsapi.springerservice.com/book/citations?doi=10.1007/978-3-540-30299-
5&s=51&p=20
36. 35
Citations APIs
• Recent citations: live feed receiving daily Crossref data; not just about recent Springer Nature
articles; this is about ANY new article citing (/citation) a Springer Nature article (/item) from
any time.
• http://citations.springer.com/xml/recent_citations/2015-02-03?start=1&end=10
• http://citations.springer.com/xml/recent_citations/2015-02-03?maxResults=10
37. 3636
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
Wrap Up
38. 3737
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
Closing remarks
39. 38
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
38
Springer Nature Publications About TDM
40. 39
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
39
• Springer Nature also offers direct metadata delivery options in various formats, such
as ONIX or MARC records, using different protocols (ftp/ftps, sftp) including the
Open Archives Initiative for metadata harvesting (OAI-PMH).
• For direct metadata download features via the metadata downloader see
https://metadata.springernature.com.
• If you have requests for metadata delivery please contact
dds.support@springernature.com.
Metadata Delivery
41. 40
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
40
Metadata Delivery
42. 42
Text and Data Mining at Springer Nature | Markus Kaindl | Frankfurt Book Fair | October 2018
42
Information professionals, knowledge workers and
librarians have a long familiarity with large sets of
information. What has changed is the development of
sophisticated tools for text- and data-mining (TDM) of
disparate data sets.
In this white paper, Mary Ellen Bates will offer insights
into how “info pros” can raise their role in TDM projects
within their organizations by better appreciation for
what aspects of TDM most benefit from a data
scientist’s point of view and how “info pros” can best
leverage their professional expertise in this new field.
Whitepaper: Bringing Insight to Data:
Info Professionals’ Role in Text and Data Mining
43. Dr. Michele Cristovao
Project and Innovation Manager
Database Research Group
Applications of Artificial Intelligence in
Life Sciences and Nanotech research
11th of October – 3pm, right here!