The document discusses various topics related to data exploration and cleaning for a Coursera Data Science Capstone project. It covers checking data integrity by comparing file signatures, understanding different data types like text which require preprocessing, and demonstrating various text preprocessing techniques in R and other languages like Python NLTK. The goal is to understand the data and prepare it for building a predictive model by applying natural language processing concepts. Alternative faster methods to the default R packages are also presented.
Schemas for the Real World [Madison RubyConf 2013]Carina C. Zona
Social app development challenges us how to code for users’ personal world. Users are giving push-back to ill-fitted assumptions about their identity — including name, gender, sexual orientation, important relationships, and other attributes they value.
How can we balance users’ realities with an app’s business requirements?
Facebook, Google+, and others are grappling with these questions. Resilient approaches arise from an app’s own foundation. Discover schemas’ influence over codebase, UX, and development itself. Learn how we can use schemas to both inspire users and generate data we need as developers.
--
META
Where: Madison Ruby Conference 2013 (Madison, Wisconsin, USA)
Date: August 23, 2013
Video: http://www.confreaks.com/videos/2627-madisonruby2013-schemas-for-the-real-world
Talk from Tech SEO Boost 2019 by Dawn Anderson on the move to the just in time predictive personalised search experience for search engines and users. Exploring recommender systems, collaborative filtering, temporal and location based queries and the rise of predictive, personal dynamic search. Exploring the work of information retrieval researchers and Google Discover.
Gavin Bell Toc09 Long Tail Needs Community SmGavin Bell
How publishers can move beyond book sales and start running services which draw together the communities of people who have read the books they publish.
The talk focuses on user experience design concepts and references activity theory as a strong future model.
Schemas for the Real World [Madison RubyConf 2013]Carina C. Zona
Social app development challenges us how to code for users’ personal world. Users are giving push-back to ill-fitted assumptions about their identity — including name, gender, sexual orientation, important relationships, and other attributes they value.
How can we balance users’ realities with an app’s business requirements?
Facebook, Google+, and others are grappling with these questions. Resilient approaches arise from an app’s own foundation. Discover schemas’ influence over codebase, UX, and development itself. Learn how we can use schemas to both inspire users and generate data we need as developers.
--
META
Where: Madison Ruby Conference 2013 (Madison, Wisconsin, USA)
Date: August 23, 2013
Video: http://www.confreaks.com/videos/2627-madisonruby2013-schemas-for-the-real-world
Talk from Tech SEO Boost 2019 by Dawn Anderson on the move to the just in time predictive personalised search experience for search engines and users. Exploring recommender systems, collaborative filtering, temporal and location based queries and the rise of predictive, personal dynamic search. Exploring the work of information retrieval researchers and Google Discover.
Gavin Bell Toc09 Long Tail Needs Community SmGavin Bell
How publishers can move beyond book sales and start running services which draw together the communities of people who have read the books they publish.
The talk focuses on user experience design concepts and references activity theory as a strong future model.
True Intent: The Best Online Benchmark You've Never MeasuredUXPA International
UXPA 2013 Annual Conference - Wednesday July 10, 2013 by Ted Boren
Remote testing can tell you how easy your site is to use... for selected tasks. Surveys tell you how visitors feel... but lack performance data. Web analytics tell you where visitors go... but not whether they actually succeed.
"True intent" studies bridge these gaps and help your team learn what's really happening on your site, by asking real visitors why they came, tracking where they go, then allowing them to tell you if they succeeded. Work together to build affinity diagrams of intents and conduct a detailed failure analysis for even deeper insights that can shape your team's strategy for years.
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...Monica Bulger
Presented at the Click-on-Knowledge Conference May 11-13, 2011 in Copenhagen.
Smiljana Antonijevic & Monica Bulger
This paper presents findings of a fieldwork study that explored research practices, challenges, and directions in contemporary digital humanities scholarship. The study was conducted in the period April-October, 2010, as part of two research projects of the Royal Netherlands Academy of Arts and Sciences and the Oxford Internet Institute— Alflalab (http://alfalablog.huygensinstituut.nl/) and Humanities Information Practices (http://www.oii.ox.ac.uk/research/?id=58). The study included observations and in-depth interviews with digital humanities scholars, policymakers, and funders, with a focus on developers and users of digital resources for humanities research. The study involved 86 participants from over 25 institutions in 5 countries. Participating institutions included: Huygens Institute; National Endowment for Humanities Office of Digital Humanities; Stanford University; University of Alberta; University of California, Berkeley; University of California, Los Angeles; University of Indiana; University of Maryland; University of Oxford; University of Virginia.
Researchers, Discovery and the Internet: What Next?David Smith
A web2.0 issues and implications overview I put together for the Research Information Network as part of their workshop on researchers and discovery services.
http://www.rin.ac.uk/discovery-services-workshop
Lecture to SIPA students on basics of creating data visualisations in multi-language, very-diverse-datasets developing-world / emerging-economy environments.
How To Effectively Communicate With TechiesHelen Linda
Learn simple tips & tools for creating IT & systems interactions that are smooth, fast, & friendly. Helen Linda, Library Systems & Technical Services Librarian at Goddard College in Central Vermont has a foot in both IT and library. She'll share with you the types of things that she prefers when problems & ideas are reported, as well as things her fellow techies have shared. By the end of this session, you will have concrete examples and tools to try so that you will be the library hero for your IT & systems person or department.
Webinar - Libraries As Innovation Hubs - 2017-05-31TechSoup
Public libraries are hubs for innovation and community engagement. Library workers must listen closely to community needs to design programs and services responsive to continuous changes in technology and fluctuations in funding. This free webinar showcases two examples of collaborative design events used in public libraries to generate ideas, build community, and solve problems.
Chris Kyauk talks about the Alameda County Library’s Innovation Fest, a collaborative team challenge for public library staff. The Innovation Fest was designed to help library staff become more comfortable with design processes and risk taking to better create solutions to serve library patrons.
Sarah Washburn shares Caravan Studio’s process for community centered design to develop technology solutions that solve real problems. This process has been successfully applied to projects with libraries as a key strategic collaborator, most recently in Indianapolis (IN).
2013 Electronic Resources and Libraries Keynote
How the network changes the way we work, how librarians need to embrace their mission and step into the broader information ecology
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
More Related Content
Similar to IDA MOOC Coursera Data Science Capstone (Data Cleaning/Data Exploration)
True Intent: The Best Online Benchmark You've Never MeasuredUXPA International
UXPA 2013 Annual Conference - Wednesday July 10, 2013 by Ted Boren
Remote testing can tell you how easy your site is to use... for selected tasks. Surveys tell you how visitors feel... but lack performance data. Web analytics tell you where visitors go... but not whether they actually succeed.
"True intent" studies bridge these gaps and help your team learn what's really happening on your site, by asking real visitors why they came, tracking where they go, then allowing them to tell you if they succeeded. Work together to build affinity diagrams of intents and conduct a detailed failure analysis for even deeper insights that can shape your team's strategy for years.
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...Monica Bulger
Presented at the Click-on-Knowledge Conference May 11-13, 2011 in Copenhagen.
Smiljana Antonijevic & Monica Bulger
This paper presents findings of a fieldwork study that explored research practices, challenges, and directions in contemporary digital humanities scholarship. The study was conducted in the period April-October, 2010, as part of two research projects of the Royal Netherlands Academy of Arts and Sciences and the Oxford Internet Institute— Alflalab (http://alfalablog.huygensinstituut.nl/) and Humanities Information Practices (http://www.oii.ox.ac.uk/research/?id=58). The study included observations and in-depth interviews with digital humanities scholars, policymakers, and funders, with a focus on developers and users of digital resources for humanities research. The study involved 86 participants from over 25 institutions in 5 countries. Participating institutions included: Huygens Institute; National Endowment for Humanities Office of Digital Humanities; Stanford University; University of Alberta; University of California, Berkeley; University of California, Los Angeles; University of Indiana; University of Maryland; University of Oxford; University of Virginia.
Researchers, Discovery and the Internet: What Next?David Smith
A web2.0 issues and implications overview I put together for the Research Information Network as part of their workshop on researchers and discovery services.
http://www.rin.ac.uk/discovery-services-workshop
Lecture to SIPA students on basics of creating data visualisations in multi-language, very-diverse-datasets developing-world / emerging-economy environments.
How To Effectively Communicate With TechiesHelen Linda
Learn simple tips & tools for creating IT & systems interactions that are smooth, fast, & friendly. Helen Linda, Library Systems & Technical Services Librarian at Goddard College in Central Vermont has a foot in both IT and library. She'll share with you the types of things that she prefers when problems & ideas are reported, as well as things her fellow techies have shared. By the end of this session, you will have concrete examples and tools to try so that you will be the library hero for your IT & systems person or department.
Webinar - Libraries As Innovation Hubs - 2017-05-31TechSoup
Public libraries are hubs for innovation and community engagement. Library workers must listen closely to community needs to design programs and services responsive to continuous changes in technology and fluctuations in funding. This free webinar showcases two examples of collaborative design events used in public libraries to generate ideas, build community, and solve problems.
Chris Kyauk talks about the Alameda County Library’s Innovation Fest, a collaborative team challenge for public library staff. The Innovation Fest was designed to help library staff become more comfortable with design processes and risk taking to better create solutions to serve library patrons.
Sarah Washburn shares Caravan Studio’s process for community centered design to develop technology solutions that solve real problems. This process has been successfully applied to projects with libraries as a key strategic collaborator, most recently in Indianapolis (IN).
2013 Electronic Resources and Libraries Keynote
How the network changes the way we work, how librarians need to embrace their mission and step into the broader information ecology
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
6. That’s when I thought nothing is as it seems.No tomorrow can be predicted accurately .We can just
plan,hope,wish or pray for things but then there is a higher power in action who makes everyday different
for every person so that he/she can learn ,grow and live to be a better person.
The Atlantic Yards development is supported by a wide range of elected officials, unions, community
leaders, issue advocates, urban development experts, religious leaders and organizations, local businesses
and thousands of fans all across Brooklyn and the New York area.
In this gauntlet of openness to new ways of thinking, Mercury makes an action-demanding link to Pluto the
transformer. Caution goes out the window. Raw truths, even secrets, come to the surface. Game-changing
information pops out. Thinking changes, permanently. Certain concepts aren’t going along for the ride any
longer. The cumulative result is streamlining, liberating, motivating and energizing.
We’ll be selling soda at the county fair this Sunday afternoon as part of their fundraiser.
Jim was a poet and a musician. He wrote some beautiful pieces and I’ve always been drawn to him since I
heard the Doors for the first time at an old boyfriend’s house. I’ve always be
en inspired to express myself both with lyrics and stories or visually with photographs and
drawings.Friend, let’s stand before God often and ask Him to reveal things in our life that are not
pleasing to Him. As we do that, the world will not have any stones to throw. They will be drawn to Jesus
and living their life for Him just like we should.
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
If you have an alternative argument, let's hear it! :)
If I were a bear,
Other friends have similar stories, of how they were treated brusquely by Laurelwood staff, and as often
as not, the same names keep coming up. About a half-dozen friends of mine refuse to step foot in there
ever again because of it. How many others they’re telling - and keeping away - one can only guess.
9. That’s when I thought nothing is as it seems.No
tomorrow can be predicted accurately .We can just
plan,hope,wish or pray for things but then there is a
higher power in action who makes everyday different
for every person so that he/she can learn ,grow and
live to be a better person.
The Atlantic Yards development is supported by a wide
range of elected officials, unions, community
leaders, issue advocates, urban development experts,
religious leaders and organizations, local businesses
and thousands of fans all across Brooklyn and the New
York area.
In this gauntlet of openness to new ways of thinking,
Mercury makes an action-demanding link to Pluto the
transformer. Caution goes out the window. Raw truths,
even secrets, come to the surface. Game-changing
information pops out. Thinking changes, permanently.
Certain concepts aren’t going along for the ride any
longer. The cumulative result is streamlining,
liberating, motivating and energizing.
We’ll be selling soda at the county fair this Sunday
afternoon as part of their fundraiser.
Jim was a poet and a musician. He wrote some
beautiful pieces and I’ve always been drawn to him
since I heard the Doors for the first time at an old
boyfriend’s house. I’ve always be
12. Differences between previous courses and
capstone
Tabular Data (previous courses)
• Unlike data cleaning of tabular data,
where you
• fill missing values with mean/median/
mode
• remove columns with no variance
• remove columns that are 100% unique
(e.g. id)
• selecting or combining/computing
new columns as features
Text data (Capstone)
• Applying the concepts (previous slides)
affects your data and features for the
model
• Which are the relevant concepts?
• Stemming?
• Part-of-speech tagging?
• Stopwords?
• Edit distance?
13. Are these two sentences the same?
There is “something” going on right now. Like right… now. He’s there.
There is "something" going on right now. Like right... now. He's there.
14. Getting data and verifying data integrity
• Getting data seems trivial and
easy: just click Download
• How to ensure you are getting
what it is?
• File signature in terms of a hash
(MD5, SHA1, etc)
• MD5 and SHA1 are hashing
algorithm to convert content into
a long string of hexadecimal
15. File Integrity and Content Check
Demo
https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning/blob/master/
file_integrity_demo.R
16. Take home message from the demo
• Good practise to check your file signature
• Something that means the same in our interpretation may not be the
same to a computer.
• Will impact your work and outcome
17. Understanding the data
That’s when I thought nothing is as it seems.No tomorrow can be
predicted accurately .We can just plan,hope,wish or pray for things
but then there is a higher power in action who makes everyday
different for every person so that he/she can learn ,grow and live to
be a better person.
The Atlantic Yards development is supported by a wide range of
elected officials, unions, community leaders, issue advocates, urban
development experts, religious leaders and organizations, local
businesses and thousands of fans all across Brooklyn and the New York
area.
In this gauntlet of openness to new ways of thinking, Mercury makes
an action-demanding link to Pluto the transformer. Caution goes out
the window. Raw truths, even secrets, come to the surface. Game-
changing information pops out. Thinking changes, permanently. Certain
concepts aren’t going along for the ride any longer. The cumulative
result is streamlining, liberating, motivating and energizing.
We’ll be selling soda at the county fair this Sunday afternoon as
part of their fundraiser.
Jim was a poet and a musician. He wrote some beautiful pieces and
I’ve always been drawn to him since I heard the Doors for the first
time at an old boyfriend’s house. I’ve always be
en inspired to express myself both with lyrics and stories or
visually with photographs and drawings.Friend, let’s stand before God
often and ask Him to reveal things in our life that are not pleasing
to Him. As we do that, the world will not have any stones to throw.
They will be drawn to Jesus and living their life for Him just like
we should.
• Blogs, news and tweets are
modern web contents
• Contents are not just alphabets,
numbers, punctuations but also
• Emoticons 😬😀
• Platform specific terms:
@mention, #hashtag, stock
symbols
18. Dealing with content
• One of the tasks in the Capstone is to remove profanity
• Several approaches (applies to other things as well)
1. Remove that word only
2. Remove the line containing the word
3. Replace the word with a placeholder
• The key question to ask is: How will my action affect the outcome?
19. Understanding what your tokeniser does
• Tokeniser can come from NLP or RWeka or other packages
• RWeka uses Weka: http://weka.sourceforge.net/doc.dev/weka/
core/tokenizers/NGramTokenizer.html
• Different tokeniser has different behaviour, thus different outcome.
21. something" going 1
“something” going 1
going on 2
he's there. 1
he’s there. 1
is "something" 1
is “something” 1
like right... 1
like right… 1
now, like 2
now. he's 1
now. he’s 1
on right 2
right now, 2
right... now. 1
right… now. 1
there is 2
going on 1
he s 1
is something 1
like right 1
now he 1
now like 1
on right 1
right now 2
s there 1
something going 1
there is 1
Bigrams
from NLP
Bigrams
from RWeka
22. What happened?
• These 2 sentences are the same to us, but not to the computer.
• In modern web content, there is something called Encoding.
• The double quotations (left and right), apostrophe, triple dots on the first line are
all Unicode characters.
• Dealing with it as previously mentioned: remove or replace.
• But consider that do you want to predict when your user typed “something” or
something or both.
There is “something” going on right now. Like right… now. He’s there.
There is "something" going on right now. Like right... now. He's there.
23. Post-processing
• Why need post-processing?
• Tokenisers are limited in capability
• Get cleaner data
something going
going on
he's there
is something
like right
on right
right now
there is
25. Alternative Method
• Why?
• R is slow
• There is always another way to do the same thing, faster. Thus allow you to fail-
fast and learn more, explore your data.
• For people with “other than R” programming experience and the interested
• Unix commands (grep, wc, awk, etc)
• Python
• NLTK (Natural Language Toolkit)
26. Benefits?
• Processed 100% of all 3 text files
within 16 GB RAM
• Caveat: do you need all the data
for a representative model?
• Processing time is about 10 to 30
minutes for each text file
depending on tokenisation length
• Failing-fast allows you to discover
your data
Time taken to do Quiz 1
(sec)
0
10
20
30
40
*nix CLI R
27. Python NLTK Bigram Tokeniser
• Using cleaned data (Unicode
characters handled)
• Capital letters, end-of-sentence and
clause details are all preserved.
• Post-processing can be done to filter
unigram and invalid bigrams, etc.
There is 1
on right 1
... now 1
He 's 1
like right 1
`` something 1
right ... 1
, like 1
now , 1
something '' 1
now . 1
there . 1
' going 1
going on 1
right now 1
s there 1
is `` 1