1. CSV-X: A Linked Data Enabled
Schema Language, Model,
and Processing Engine
for Non-Uniform CSV
Graduate School of Interdisciplinary Information Studies
University of Tokyo
2. The Era of Data
• BIG Data from
• Users Generated Contents (Blog, YouTube, Pinterest, etc.)
• Social Networks
• Open Government Data (data.gov, etc.)
• Mobile, Internet of Things, Sensors
6. Status of Today’s
Open Data Formats
• Most popular data format in world’s open data are still
tabular-based: xls & csv/tsv . Over 90% in data.gov.uk
are tabular .
• However, CSV is very limited data format: No formal data
structure, no datatype, no schema..
• Why still being used widely?
• Easy to produce from existing tools (Rel. DB, Excel, etc.)
• Creating XML or RDF “cost” more technically &
ﬁnancially (even for US & UK gov units) 
7. • There is a standard (W3C) and many tools and
trying to upgrade CSV 
• However, most of the tools only support CSV as
deﬁned by IETF’s RFC 4180 memo 
• > 40% of CSV are being ignored in an ODI study
on all CSV in data.gov.uk 
ID Name Address Remark
1, John Doe, “Main St.”,
2, Clark Kent, 5th Ave., “foo CRLF
10. Method 2: Schema Model
The model MUST:
• Represent single/group of value (cell/row/col)
• Capture hidden relation between values (property)
• Specify template and data for transformation and annotation
Schema Language, Model, and Processing Engine for Non-
• Describe patterns and relations using ﬂexible schema
constructs, adaptive matching algorithm, and cross
Features: Parse, Annotate, Altering, Validation, Cross-Referencing,
auto-RDF Serialization and Template-based Transformation
• Implemented in Java with 3K LOC (w/o library, comment, and
• Evaluation focused on expressivity and functional
1. 7 real-world complex non-uniform CSV from W3C
CSVW use cases report  and US, UK, JPN,
and TH open data sites
2. Identify non-uniform patterns and compose the
3. Test operations: parsing, annotation, validation,
cross-referencing, and RDF transformation
18. Results and Conclusion
• Our deﬁnitions of non-
uniform CSV patterns cover
all patterns that appear in
• CSV-X can process variety
of complex non-uniform
CSV in real-world datasets
• It’s hope that CSV-X can help in publishing high-quality
data for iSWoT and open data community alike
19. References 1
 “Linked Data - Design Issues.” [Online]. Available: https://www.w3.org/
DesignIssues/LinkedData.html. [Accessed: 05-Aug- 2016].
 T. Lebo, J. S. Erickson, L. Ding, A. Graves, G. T. Williams, D. DiFranzo, X. Li, J.
Michaelis, J. G. Zheng, J. Flores, Z. Shangguan, D. L. McGuinness, and J. Hendler,
“Producing and Using Linked Open Government Data in the TWC LOGD Portal,” in
Linking Government Data, D. Wood, Ed. Springer New York, 2011, pp. 51–72.
 “CSV Schema Language 1.1.” [Online]. Available: http://digital-
preservation.github.io/csv-schema/csv-schema-1.1.html. [Accessed: 09- Jul-2016].
 “2014: The Year of CSV | News,” Open Data Institute. [Online]. Available: https://
theodi.org/blog/2014-the-year-of-csv. [Accessed: 15- Jul-2016].
 T. Davies, R. M. Sharif, and J. M. Alonso, “Open Data Barometer Global Report,”
World Wide Web Found., 2015.
20. References 2
 W. Martens, F. Neven, and S. Vansummeren, “SCULPT: A Schema Language for
Tabular Data on the Web,” in Proceedings of the 24th International Conference on
World Wide Web, New York, NY, USA, 2015, pp. 702–720.
 “Model for Tabular Data and Metadata on the Web.” [Online]. Available: http://
www.w3.org/TR/2015/REC-tabular-data-model- 20151217/. [Accessed: 29-Jul-2016].
 P. E. R. Salas, M. Martin, F. M. Da Mota, S. Auer, K. Breitman, and M. A.
Casanova, “Publishing statistical data on the web,” in Semantic Computing (ICSC),
2012 IEEE Sixth International Conference on, 2012, pp. 285–292.
 A. Langegger and W. Wös s, “XLWrap–Querying and Integrating Arbitrary
Spreadsheets with SPARQL.”
 “What is a CSV? A case study of CSVs on data.gov.uk.” [Online]. Available:
21. References 3
 Y. Shafranovich, “Common format and MIME type for comma- separated values (CSV)
 “JSON-LD 1.0.” [Online]. Available: https://www.w3.org/TR/json-ld/. [Accessed: 29-
 M. Compton, P. Barnaghi, L. Bermudez, R. García-Castro, O. Corcho, S. Cox, J.
Graybeal, M. Hauswirth, C. Henson, and A. Herzog, “The SSN ontology of the W3C semantic
sensor network incubator group,” Web Semant. Sci. Serv. Agents World Wide Web, 2012.
 J. Tandy, D. Ceolin, and E. Stephan, “CSV on the Web: Use cases and requirements,”
W3C Working Group Note, 25-Feb-2016. [Online]. Available: https://www.w3.org/TR/csvw-
 R. Cyganiak, D. Reynolds, and J. Tennison, “The RDF data cube vocabulary,” W3C
Recomm. January 2014, 2013.
 S. Auer, S. Dietzold, and T. Riechert, “OntoWiki–a tool for social, semantic
22. References 4
 C. C. Aggarwal, N. Ashish, and A. Sheth, “The internet of
things: A survey from the data-centric perspective,” in Managing
and mining sensor data, Springer, 2013, pp. 383–428.
 Guinard, D., Trifa, V., Mattern, F., & Wilde, E. (2011). From the
Internet of Things to the Web of Things: Resource-oriented
Architecture and Best Practices. In D. Uckelmann, M. Harrison, &
F. Michahelles (Eds.), Architecting the Internet of Things (pp. 97–
129). Springer Berlin Heidelberg. Retrieved from http://
 D. Pﬁsterer et al., “SPITFIRE: toward a semantic web of
things.,” IEEE Communications Magazine, vol. 49, no. 11, pp. 40–
• Human Localization Sensor Ontology: Enabling
OWL 2 DL-Based Search for User’s Location-Aware
Sensors in the IoT. In 2016 IEEE Tenth International
Conference on Semantic Computing (ICSC) (pp.
• CSV-X: A Linked Data Enabled Schema Language,
Model, and Processing Engine for Non-Uniform
CSV. To be appeared in proceeding of 2016 IEEE
International Conference on Smart Data
(SmartData), Dec 16-19, China