CSV-X is a schema language, model, and processing engine for non-uniform CSV enabling annotation, validation, cross-referencing, Linked Data, RDF serialization, and transformation to other formats.
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
CSV-X
1. CSV-X: A Linked Data Enabled
Schema Language, Model,
and Processing Engine
for Non-Uniform CSV
Wirawit Chaochaisit
Sakamura-Koshizuka Laboratory
Graduate School of Interdisciplinary Information Studies
University of Tokyo
2. The Era of Data
• BIG Data from
• Users Generated Contents (Blog, YouTube, Pinterest, etc.)
• Social Networks
• Open Government Data (data.gov, etc.)
• Mobile, Internet of Things, Sensors
6. Status of Today’s
Open Data Formats
• Most popular data format in world’s open data are still
tabular-based: xls & csv/tsv [5]. Over 90% in data.gov.uk
are tabular [4].
• However, CSV is very limited data format: No formal data
structure, no datatype, no schema..
• Why still being used widely?
• Easy to produce from existing tools (Rel. DB, Excel, etc.)
• Creating XML or RDF “cost” more technically &
financially (even for US & UK gov units) [2][3]
6
7. • There is a standard (W3C) and many tools and
trying to upgrade CSV [2][3][6][7][8][9]
• However, most of the tools only support CSV as
defined by IETF’s RFC 4180 memo [11]
• > 40% of CSV are being ignored in an ODI study
on all CSV in data.gov.uk [10]
ID Name Address Remark
1, John Doe, “Main St.”,
2, Clark Kent, 5th Ave., “foo CRLF
bar”
CRLF
CRLF
CRLF
7
8. Challenge in Non-Uniform CSV
How can we describe these random patterns so
that we can perform automatic processing?
8
10. Method 2: Schema Model
The model MUST:
• Represent single/group of value (cell/row/col)
• Capture hidden relation between values (property)
• Specify template and data for transformation and annotation
10
11. CSV-X
Schema Language, Model, and Processing Engine for Non-
Uniform CSV
• Describe patterns and relations using flexible schema
constructs, adaptive matching algorithm, and cross
referencing techniques
Features: Parse, Annotate, Altering, Validation, Cross-Referencing,
auto-RDF Serialization and Template-based Transformation
11
16. Implementation
• Implemented in Java with 3K LOC (w/o library, comment, and
blank lines)
Demo: http://www.dadfha.com:3232
Github: https://github.com/nabito/csv-x
16
17. Evaluation
• Evaluation focused on expressivity and functional
validation:
1. 7 real-world complex non-uniform CSV from W3C
CSVW use cases report [14] and US, UK, JPN,
and TH open data sites
2. Identify non-uniform patterns and compose the
schema
3. Test operations: parsing, annotation, validation,
cross-referencing, and RDF transformation
17
18. Results and Conclusion
• Our definitions of non-
uniform CSV patterns cover
all patterns that appear in
sample datasets
• CSV-X can process variety
of complex non-uniform
CSV in real-world datasets
18
• It’s hope that CSV-X can help in publishing high-quality
data for iSWoT and open data community alike
19. References 1
[1] “Linked Data - Design Issues.” [Online]. Available: https://www.w3.org/
DesignIssues/LinkedData.html. [Accessed: 05-Aug- 2016].
[2] T. Lebo, J. S. Erickson, L. Ding, A. Graves, G. T. Williams, D. DiFranzo, X. Li, J.
Michaelis, J. G. Zheng, J. Flores, Z. Shangguan, D. L. McGuinness, and J. Hendler,
“Producing and Using Linked Open Government Data in the TWC LOGD Portal,” in
Linking Government Data, D. Wood, Ed. Springer New York, 2011, pp. 51–72.
[3] “CSV Schema Language 1.1.” [Online]. Available: http://digital-
preservation.github.io/csv-schema/csv-schema-1.1.html. [Accessed: 09- Jul-2016].
[4] “2014: The Year of CSV | News,” Open Data Institute. [Online]. Available: https://
theodi.org/blog/2014-the-year-of-csv. [Accessed: 15- Jul-2016].
[5] T. Davies, R. M. Sharif, and J. M. Alonso, “Open Data Barometer Global Report,”
World Wide Web Found., 2015.
19
20. References 2
[6] W. Martens, F. Neven, and S. Vansummeren, “SCULPT: A Schema Language for
Tabular Data on the Web,” in Proceedings of the 24th International Conference on
World Wide Web, New York, NY, USA, 2015, pp. 702–720.
[7] “Model for Tabular Data and Metadata on the Web.” [Online]. Available: http://
www.w3.org/TR/2015/REC-tabular-data-model- 20151217/. [Accessed: 29-Jul-2016].
[8] P. E. R. Salas, M. Martin, F. M. Da Mota, S. Auer, K. Breitman, and M. A.
Casanova, “Publishing statistical data on the web,” in Semantic Computing (ICSC),
2012 IEEE Sixth International Conference on, 2012, pp. 285–292.
[9] A. Langegger and W. Wös s, “XLWrap–Querying and Integrating Arbitrary
Spreadsheets with SPARQL.”
[10] “What is a CSV? A case study of CSVs on data.gov.uk.” [Online]. Available:
https://theodi.github.io/blog/2014/02/18/the-status-of-csvs- on-datagovuk/.
[Accessed: 09-Jul-2016].
20
21. References 3
[11] Y. Shafranovich, “Common format and MIME type for comma- separated values (CSV)
files,” 2005.
[12] “JSON-LD 1.0.” [Online]. Available: https://www.w3.org/TR/json-ld/. [Accessed: 29-
Jul-2016].
[13] M. Compton, P. Barnaghi, L. Bermudez, R. García-Castro, O. Corcho, S. Cox, J.
Graybeal, M. Hauswirth, C. Henson, and A. Herzog, “The SSN ontology of the W3C semantic
sensor network incubator group,” Web Semant. Sci. Serv. Agents World Wide Web, 2012.
[14] J. Tandy, D. Ceolin, and E. Stephan, “CSV on the Web: Use cases and requirements,”
W3C Working Group Note, 25-Feb-2016. [Online]. Available: https://www.w3.org/TR/csvw-
ucr/.
[15] R. Cyganiak, D. Reynolds, and J. Tennison, “The RDF data cube vocabulary,” W3C
Recomm. January 2014, 2013.
[16] S. Auer, S. Dietzold, and T. Riechert, “OntoWiki–a tool for social, semantic
collaboration,”
21
22. References 4
[17] C. C. Aggarwal, N. Ashish, and A. Sheth, “The internet of
things: A survey from the data-centric perspective,” in Managing
and mining sensor data, Springer, 2013, pp. 383–428.
[18] Guinard, D., Trifa, V., Mattern, F., & Wilde, E. (2011). From the
Internet of Things to the Web of Things: Resource-oriented
Architecture and Best Practices. In D. Uckelmann, M. Harrison, &
F. Michahelles (Eds.), Architecting the Internet of Things (pp. 97–
129). Springer Berlin Heidelberg. Retrieved from http://
link.springer.com/chapter/10.1007/978-3-642-19157-2_5
[19] D. Pfisterer et al., “SPITFIRE: toward a semantic web of
things.,” IEEE Communications Magazine, vol. 49, no. 11, pp. 40–
48, 2011.
22
23. Publications
• Human Localization Sensor Ontology: Enabling
OWL 2 DL-Based Search for User’s Location-Aware
Sensors in the IoT. In 2016 IEEE Tenth International
Conference on Semantic Computing (ICSC) (pp.
107–111). https://doi.org/10.1109/ICSC.2016.31
• CSV-X: A Linked Data Enabled Schema Language,
Model, and Processing Engine for Non-Uniform
CSV. To be appeared in proceeding of 2016 IEEE
International Conference on Smart Data
(SmartData), Dec 16-19, China
23