an introductory course for Librarians on using Big Data and Data Science applications on the field of Library Science. The course is a 2 hour course module for basic fundamentals of applying DS work.
4. Program Objectives / Program Goals
Participants to be able to relate Big Data and Data Science
applications to Library services.
5. 1. What is Big Data?
Extremely large data sets that may be analyzed to reveal
patterns, trends and associations
6. The BIG 3 V’s
• Variety: different types of data
(Facebook, Twitter, CCTV feed)
• Velocity: the speed that data comes in
(batch, streaming every second)
• Volume: the largeness of that data.
(1GB, 1TB, 1PB, 1ZB)
7. Library Data Resources
What resources does the library have (budget, staff, premises, media, opening
hours etc.) and how is the library performing against traditional parameters,
like lending figures, visitors and social media activity? This library data can
also be combined with environmental information like community education
levels, geographical distances, age and so on.
http://www.axiell.co.uk/getting-the-most-from-your-library-data/
8. DATA Analytics Challenges and Pitfalls
The challenges to creating a robust institutional data analytics program include
culture, talent, cost, and data. We have deliberately mentioned culture first
because it is very easy to jump to data challenges. In fact, most of the
literature surrounding data analytics starts with challenges surrounding the
data itself. However, we are convinced that institutional culture is the most
important factor in determining the success of any given data analytics
program, including the politics and process around questions of talent, cost,
and data itself.
Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries:
Challenges and Opportunities
63% of researchers and administrators expressed unhappiness with the use
of metrics in higher education (Abbott et al., 2010)
9. What about New Tasks like streamlining for the Librarian?
If librarians take on new tasks, it is very important to track the amount of
time and level of staff required when undertaking analytics projects.
For example, collecting citation data for a researcher with a common
name often requires manual and painstaking record-by-record
searching in order to disambiguate that individual's research from
others that share his/her name. This type of work requires a librarian
with a deep and intimate knowledge of the bibliometric databases that
are being used to harvest the bibliometric data.
Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries:
Challenges and Opportunities
10. What is the Cost?
• Data analytics should be thought of as a strategic
investment, not a cost-saving technique
• the real cost is the time spent on cultural change and on
developing and educating a staff with the analytical skills
that we need in our discipline
• visionary analytics plan invests in people, in hiring and
training, over data tools and platforms.
.
11. Pitfalls of Data Sharing:
Challenges on Institutional Data Analytics
Pitfalls Possible Solution/s
Ownership: who owns the data? It
could be registrar, library, IT
services.
An assigned office e.g. or Office of
the President/ Compliance Office
can release the official reports.
Quality: deciding when it is accurate
or good data, data reliability.
Data Governance Unit assures the
quality of data
Standards: what kind of data
variables are in use: string, numeric
This can be addressed by Data
Management on data warehousing
Access: who has access to the data User roles can be defined as to who
has access
12. Getting Started on Institutional Data
• Creating an inventory of institutional data
• Developing a data dictionary
• Designing an unambiguous process for cleaning up those data
• Creating an open data set that answers to the most commonly asked
data questions across campus.
13. Opportunities for Libraries on Big Data
• Libraries know metadata
• Libraries know strategy
• Libraries know assessment
• Libraries are neutral
• Libraries know the vendors
• Libraries are part of larger bodies like PAARL
• Libraries have influence over campuses
• Libraries know metrics
• Libraries have user-centered culture
• Libraries know the vendors
• Libraries know the politics and policy issues with commercial parties
• Libraries collaborate with both academic and academic support
14. 2. Building a BIG DATA culture
• Openness and acceptance to technology: Upper Management
• Willingness to invest in the Big Data Platform: which entails cost
• Training Staff and making sure of job security: Skills upgrade
• Make data sharing acceptable: Trust in the data quality and people
• Create Data Quality Assurance Team/s
• Foster collaboration among departments
• Continuous improvement of models
15. DATA Governance and DATA Management are different roles
Data governance is the designation of decision-rights and policy-making surrounding institutional data,
while data management is the implementation of those decisions and policies. Institutions need both,
and both require investment, but the senior leadership of our institutions need to design the former.
Data Governance Council
Data Management
policies
metrics
Data Quality Dept
Data Warehouse / Data
Lake
16. Machine Learning
Is a type of artificial intelligence that provides
computers with the ability to learn without being
explicitly programmed.
18. Weather related information and reading a book (use of hash tags and location and weather data)
Pic from Marco Rasos
19. Social Listening – is the process of monitoring digital conversations to
understand what customers are saying about a brand or service.
20. Online Research Journals and Click through Rates
Click through Rates (CTR)
Ratio of users who click on a specific
link to get to a page from a page ad or
button.
22. Modern Day Data Scientists
Dr. Reina Reyes, Astrophysicist
Andrew Ng of Baidu, Coursera
Amy Smith, Uber Singapore
Data Science Conference 2016
YOU as the next
Doctor Strange
(Entering the world of
Data Science)
Isaac Reyes, Data Scientist Talas Data Scientists
23. CRISP – DM Methodology
The project was led by five
companies: SPSS,
Teradata, Daimler AG,
NCR Corporation and
OHRA, an insurance
company
25. From regular data to BIG data, from stat to AI
RegulardataBIGdata
Statistical modeling
Machine Learning
Deep Learning / A.I.
Traditional Modern
26. Trends in Data Science Domains
Data Science Domain Current Status
Natural Language
Processing (NLP)
Entered the market
Predictive Analytics /
Machine Learning
Entered the market
Visualization /
Dashboards
Entered the market
Image Processing
(openCV)
Exploration
Internet of Things (IoT) Exploration
Artificial Intelligence Exploration
27. DS/Big Data Applications to the field of Study
Agriculture Climate forecast modeling to help farmers
manage plantations (e.g. corn yields)
Medical field Image processing for chest x rays,
retina images for diabetic patients
Linguistics Natural Language Processing (NLP) for
dialects and Sentiment Analysis applications
Economics/Finance Predicting a stock price based on certain
indicators (e.g. noise, competitor price)
Sample Field of Study Specific Applications
Engineering Internet of Things (IoT) application to Big Data
28. Building a Data Science Team
Data ScientistData Engineer/
Dev Ops
Statistician Viz Expert
R,
Python,
Spark ML
Hadoop,
Spark Core,
Spark stream
SAS,
SPSS,
R, Matlab
Tableau, Cognos
D3, Javascript
Neural Nets
Random Forest
RDD, dataframes,
SQLContext
Linear Regression
K-means clustering
visualization
GIS maps
DS
role
Prog
Language
Sample
output
Data Science Team Composition
1 2 3
30. TOOLS: OPEN SOURCE vs PROPRIETARY SOFTWARE
OPEN SOURCE PROPRIETARY
SOFTWARE
pros No cost on software, packages are
available faster
Easy to deploy
cons Takes some time to create and
integrate with other software
Expensive software,
you have do buy in
modules
tools Python, R, Apache Spark SAS, IBM-SPSS,
AWS, Google
31. Small Data vs Big Data (in comparison)
Small data Big data
Sample size can be done
(sampling e.g. survey)
Use all of the data in the storage
No need for memory computing,
can be run on a regular PC/Mac
Eats up memory and needs
distributed computing
Statistical assumptions hold
true,
normality, heteroskedasticity
independence
Statistical assumptions do not
hold true like p-values since the
data is so large (what seems not
significant to small sets will
become significant, be careful
when using these assumptions)
32. Simple DS Cheat sheet
Classifiers
Neural Nets
Random forest
Clustering
K-means
Association
Assoc Rules
Predicting
Linear
Regression
Logistic
Regression
(binary)
Cox Regression
(Survival)
Hierarchical
Clustering
SVM (Cancer Cells)
Medical
35. Local Implications: Data Privacy Act 10173
Sensitive personal information refers to personal information:
1. About an individual’s race, ethnic origin, marital status, age, color, and religious, philosophical or
political affiliations;
2. About an individual’s health, education, genetic or sexual life of a person, or to any proceeding for
any offense committed or alleged to have been committed by such individual, the disposal of such
proceedings, or the sentence of any court in such proceedings;
3. Issued by government agencies peculiar to an individual which includes, but is not limited to, social
security numbers, previous or current health records, licenses or its denials, suspension or revocation,
and tax returns; and
4. Specifically established by an executive order or an act of Congress to be kept classified.
36. Solutions to the Data Privacy Act: Policies
Make sure you have the following in place
• Opt In for customers
• Opt out for customers
• Updated your customer policy accordingly
• Make your policy available publicly e.g. websites