Quick introduction about Apache Spark and how it fits in the cognitive world, how can we use it to help cognitive solutions as well as create distributed algorithms to predict and perform other machine learning tasks.
2. Agenda
What is Spark?
Spark Libraries and Architecture
Spark role in the Cognitive world
Introducing Data Science Experience
How we are using Spark at Cognitive@IBM - Brazil
3. What is Spark?
Spark is a framework, a set of APIs and a parallel engine;
Created in AMPLab (Berkeley);
Developed in Scala (GitHub: https://github.com/apache/spark);
Used to process basically any kind of data (text files, Parquet, Avro,
databases, HDFS, S3, Object Storage, etc.);
Java, Python and Scala can be used as the programming language;
Takes advantage of RAM memory for fast processing.
6. Spark Role in the Cognitive World
Predictions
Natural Language Processing
Watson Integration
Cognitive Solutions Integrator
Cognitive Decisions in Real Time
with Watson ExplorerUnstructured Data Processing
8. How we use Spark
Environment:
◦ Developing and testing on Data Science Experience;
◦ Created our own standalone cluster with 7 workers for production running on
Softlayer;
◦ Created a auto-scaling standalone cluster using docker containers on Buemix;
Processing:
◦ Environment for fast clustering and testing new algorithms;
◦ Move structured and unstructured data from different databases;
◦ Data cleaning;
◦ To speed up ETL processes;
9. Resources
My article talking about Spark
◦ https://w3-connections.ibm.com/blogs/af5593c1-5dae-421e-87d6-
6ac263973790/entry/Spark_what_is_that?lang=en_us
My GitHub on how to create and run Spark Standalone using Docker containers on Bluemix
◦ https://github.com/brunocfnba/docker-spark-cluster
Big Data Analysis with Apache Spark Course (Free but has defined enrollment seasons)
◦ https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x
Apache Spark web site
◦ http://spark.apache.org/