Python in big data ecosystem by Nicholas Lu

Python in Big Data
Ecosystem
Nicholas Lu (Chee Seng)
PyCon Malaysia 2017

About me:
Physics and Mathematics Major. ETL developer for Warner Chappell. Glowing
passion on yellow elephant ecosystem. A pip and apt-get guy. Uses vim and
tab.
github.com/lucheeseng827

Why do we need Py
in the Tonne
World of heavy jvm and low level language as performance vs
simplicity
Total of data is immense
RAMs are getting cheaper
Less code = Less error = Less time of development

1. Intro to SMACK
➔ Spark
In-memory processing does make stuff
faster and more efficient.
➔ Kafka
How many type of straws are we using
to dry up a water tank.
➔ Cassandra
Storage of data in multiple computer
does make it faster.

➔ Mesos
Containerized environment for
ease of scalability and
management.
➔ Akka
High concurrency for better utilization
and effective processes

Python in everything
Python Package SMACK Big Guy Developer
Kafka Confluent,
Pykafka, Python-
Kafka
Kafka Confluent
Pyspark Spark DataBricks
Mesos-python Mesos Mesosphere
Cassandra-driver Cassandra DataStax
Pykka Akka Unknown

http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html

https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka

Dealing with mixed volume and velocity
Doing ETL/ELT (fixed schedule, move around)
Prefer speedy micro batch over classic batch process(second vs
minutes)
Plan to upgrade more features in the coming time

Tools Needed
Python
Docker(kafka, cassandra)
Spark
Some big flat file
A decently rammed computer

2. Flow
Sequence for the data processing flow
➔ Pipe them in
Show me the data.
➔ Collect and Subscribe
Customer data in channel 4 and
Finance in channel 2
➔ Process in Batch
Release the Kraken!
➔ Process On-The-Go
Near real time processing for higher
urgency

https://intellipaat.com/tutorial/cassandra-tutorial/brief-architecture-of-cassandra/

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html

Then, Marcos discovered
SMACK
He has his interest in Python
completely revived.
He’s able to give every project a great
SMACK. Project that provides client
fast analytics at scale.

What’s next?
Flink implementation
Apache Beam implementation
ML implementation
Implementation of Caching
DataFrames
SQL in spark streaming
DC/OS(Multi cloud tenancy)
Many more

Thank you!
For more about making this demo better
(please do give feedback to
lu.cheeseng827@gmail.com)

Python in big data ecosystem by Nicholas Lu

More Related Content

Similar to Python in big data ecosystem by Nicholas Lu

More from PYCON MY PLT

Recently uploaded

Python in big data ecosystem by Nicholas Lu

Editor's Notes