There are plenty of public datasets out there available and the number is growing. Few recent and most useful of BigData ecosystem tools are showcased: Apache Zeppelin (incubating), Apache Spark and Juju.
2. Software Engineer at NFLabs, Seoul,
South Korea
Co-organizer of SeoulTech Society
Committer and PPMC member of
Apache Zeppelin (Incubating)
@seoul_engineer
github.com/bzz
Alexander
4. CONTEXT
Size of even Public Data is huge and growing
There could be more research, applications and
data products build using that data
Quality and number of free tools available to
public to crunch that data is constantly improving
Cloud brings affordable computations @ scale
19. TOOLS OVERVIEW
Generic:
Grep, Python, Ruby, JVM - all good, but hard to
scale beyond single machine or data format
Hight-performance:
MPI, Hadoop, HPC - awesome but complex, not
easy, problem specific, not very accessible
New, scalable:
Spark, Flink, Zeppelin - easy (not simple) and
robust
21. Apache Software Foundation
1999 - 21 founders of 1 project
2016 - 9 Board of Directors
600 Foundation members
2000+ committers of 171 projects (+55 incubating)
Keywords: meritocracy, community over code, consensus
Provide: infrastructure, legal support, way of building
software
http://www.apache.org/foundation/
22. Apache Spark
Scala, Python, R
Apache Zeppelin
Modern Web GUI, plays nicely with Spark, Flink,
Elasticsearch, etc. Easy to set up.
Warcbase
Spark library for saved crawl data (WARC)
Juju
Scales, integration with Spark, Zeppelin, AWS, Ganglia
NEW, SCALABLE TOOLS
23. APACHE SPARK
From Berkeley AMP Labs, since 2010
Founded Databricks since 2013, joined
Apache since 2014
1000+ contributors
REPL + Java, Scala, Python, R APIs
http://spark.apache.org
24. APACHE SPARK
Has much more: GraphX, MLlib, SQL
https://spark.apache.org/examples.html
http://spark.apache.org
Parallel collections API (similar to FlumeJava, Crunch, Cascading)
25. • Notebook style GUI on top of backend
processing system
• Plays nicely with all the eco-system Spark,
Flink, SQL, Elasticsearch, etc.
• Easy to set up
APACHE ZEPPELIN (INCUBATING)
http://zeppelin.incubator.apache.org
30. Spark library for WARC (Web ARChive) data processing
* text analysis
* site link structure
WARCBASE
https://github.com/lintool/warcbase
http://lintool.github.io/warcbase-docs
31. Service modeling at scale
Deploymentconfiguration automation
+ Integration with Spark, Zeppelin, Ganglia, etc
+ AWS, GCE, Azure, LXC, etc
JUJU
https://jujucharms.com/