It's one of the assignments. (The collaborative filtering one)
The is an edX course going that covers Machine Learning with Python, though it does require "...familiarity with basic machine learning concepts".
"All exercises will use PySpark, but previous experience with Spark or distributed computing is NOT required. "
I am doing this course and find it really good : https://www.edx.org/course/scalable-machine-learning-uc-berk...
It is about creating a linear and logistic regression + pca using spark (python api).
Anyone who wants to pick up Spark basics - Berkeley (Spark was developed at Berkeley's AMPLab) in collaboration with DataBricks (Commercial company started by Spark creators) just started a free MOOC on edx: https://www.edx.org/course/introduction-big-data-apache-spar...
(If you wonder what is Spark, in a very unofficial nutshell - it is a computation / big data / analytics / machine learning / graph processing engine on top of Hadoop that usually performs much better and has arguably a much easier API in Python, Scala, Java and now R)
It has more than 5000 students so far, and the Professor seems to answer every single Piazza question (a popular student / teacher message board).
So far it looks really good (It started a week ago, so you can still catch up, 2nd lab is due only Friday 6/12 EOD, but you have 3 days "grace" period... and there is not too much to catch up)
I use Spark for work (Scala API) and still learned one or two new things.
It uses the PySpark API so no need to learn Scala. All homework labs are done in a iPython notebook. Very high quality so far IMHO.
It is followed by a more advanced spark course (Scalable Machine Learning) also by Berkeley & Databricks.
(not affiliated with edx, Berkeley or databricks, just thought it's a good place for a PSA to those interested)
The Spark originating academic paper by Matei Zaharia (Creator of Spark) got him a PHd dissertation award in 2014 by the ACM ( http://www.acm.org/press-room/news-releases/2015/dissertatio... )
Spark also set a new record in large scale sorting (Beating Hadoop by far): https://databricks.com/blog/2014/11/05/spark-officially-sets...
* EDIT: typo in "Berkeley", thanks gboss for noticing :)
Do they really get better? I'm either going to jump straight to the (R) Statistical Inference course from JHU, or switch to the Berkeley/EdX Spark course.
I use a lot more Spark in my day job than R, but I really should learn statistics more formally.