Inspired by trying to get up-to-speed with a new, shiny project. Anything data centric, whether machine learning or SQL, needs data.
I work for GG, donated Ignite, blah
Have you heard of Apache Ignite or GridGain?
GridGain Systems donated the code to the Apache Ignite project. It became a top level project of the Apache Software Foundation (ASF) in 2014, the second fastest to do so. Apache Ignite is now one of the top 5 Apache Software Foundation projects, and has been for 2 years now. It’s the most active in-memory computing projects right now, used by thousands of companies worldwide.
GridGain is the only commercially supported version. It adds integration, security, deployment, management and monitoring to the same core Ignite that help with business-critical applications. We also provide global support and services. We also continue to be the biggest contributor to Ignite.
[1] http://globenewswire.com/news-release/2019/07/09/1534470/0/en/The-Apache-Software-Foundation-Announces-Annual-Report-for-2019-Fiscal-Year.html
[2] https://blogs.apache.org/foundation/entry/apache-in-2017-by-the
You are probably relying on us for some part of your personal or professional life.
We have several of the top 20 banks and wealth management companies as customers. If you include FinTech, 48-50 of the world’s largest banks use us indirectly. (through Finastra)
Some of the leading software companies rely on us for their speed and scale. Microsoft uses us for real-time cloud security detection. Workday used us to get the scale they needed to sell to Walmart, and then to be about to run their software on Amazon, for Amazon.
There are some very large retail/e-commerce companies, including PayPal, HomeAway and Expedia.
And several innovators across FinTech, adTech, IoT and other areas.
Traditional databases don’t scale. Buy bigger and bigger boxes until you run out of money.
Traditional compute grids have to copy data across the network, which at modern scale is just impractical.
Ignite scales horizontally and sends compute to the data rather than the other way around.
In memory for speed. Disk persistence for volume.
You fired up a node and you want to play… how do you load data?
Oracle has SQL*Loader. Most other legacy databases have something similar. Is there an Ignite equivalent?
Simple 14 point process
Okay, I’m being facetious.
That approach is good for production. For large volumes of data. For weird and wonderful data formats.
But what if you want to do something quickly, preferably without firing up an IDE?
Ignite supports ANSI-99 SQL…
Kind of like BULK INSERT in SQL Server.
Kind of like SQL*Loader in Oracle
Good news: built-in
Bad news: only works for CSV
Basically zero configuration
sqlline -u jdbc:ignite:thin://127.0.0.1
0: jdbc:ignite:thin://127.0.0.1>COPY FROM "file.csv" INTO tablename (col1, col2) FORMAT CSV;
Which means you end up using horrible command-line tricks to convert data into CSV format. Here we’re using jq to convert from JSON to CSV
jq '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv' < file.json > file.csv
Python
Spark – kind of cheating
Start pyspark with a bunch of extra libraries so that it also understands Ignite. This is optimized for typing. You could also optimize for less code in memory.
bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar,$IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar,$IGNITE_HOME/libs/*.jar,$IGNITE_HOME/libs/ignite-indexing/*.jar
In one line we read a JSON file
It understands the structure of the file – no further coding
Filters, drop columns, etc. Functional.
b = spark.read.format('json').load('filename.json’)
b.filter('href is not null’) \
.drop('hash', 'meta’) \
.write.format('ignite’) \
.option('config','default-config.xml’) \
.option('table','bookmarks’) \
.option('primaryKeyFields','href’) \
.mode('overwrite’) \
.save()