Bill Howe did a solid intro course for the University of Washington. Videos and other materials are available on Coursera.
The one thing I'd really change is to tighten up the range of tools used. It seems helpful to show students a range of tools, but it usually ends up being a major distraction for students and a lot of extra effort for course staff. Any such course is already going to be a blitz of new concepts and technology.
Go full Python, plus interactive tools as helpful (Weka, Tableu). Let them pick up R or D3.js or whatever later, after they have a better appreciation for the concepts and such which make them useful.
So looking through this 'track', I see one course which seems like it might be more central to the discipline, "Intro to Data Science". Has anybody had a chance to compare this one against Bill Howe's "Introduction to Data Science" on Coursera?
"A lot of people fail to understand the overheads and limitations of this kind of architecture. Or how hard it is to program, especially considering salaries for this skyrocketed. More often than not a couple of large 1TB SSD PCIe and a lot of RAM can handle your "big" data problem."
It's not that hard to program... it does take a shift in how you attack problems.
If your data set fits on a few SSDs. then you probably don't have a real big data problem.
"Moving Big Data around is hard. Managing is harder."
Moving big data around is hard--that's why you have hadoop--you send the compute to where he data is, thus requiring a new way of thinking about how you do computations.
"Before doing any Map/Reduce (or equivalent), please I beg you to check out Introduction to Data Science at Coursera https://www.coursera.org/course/datasci"
Data science does not solve the big data problem. Here's my favorite definition of a big data problem: "a big data problem is when the size of the data becomes part of the problem." You can't use traditional linear programming models to handle a true big data problem; you have to have some strategy to parallelize the compute. Hadoop is great for that.
"A large telco has a 600 node cluster of powerful hardware. They barely use it."
Sounds more like organizational issues, poo planning and execution than a criticism of Hadoop!
For an introduction to the broader realm of data input, normalization, modeling, and visualization -- in which ML plays but a part -- you can "preview" Bill Howe's "Introduction to Data Science" class on Coursera; I'm working through the lectures, and I find he gives compelling explanations of what all these parts are, why they're important, and how it all fits together in a larger context.
A large telco has a 600 node cluster of powerful hardware. They barely use it. Moving Big Data around is hard. Managing is harder.
A lot of people fail to understand the overheads and limitations of this kind of architecture. Or how hard it is to program, especially considering salaries for this skyrocketed. More often than not a couple of large 1TB SSD PCIe and a lot of RAM can handle your "big" data problem.
Before doing any Map/Reduce (or equivalent), please I beg you to check out Introduction to Data Science at Coursera https://www.coursera.org/course/datasci
Seems to work great over here, and the installation was pretty easy, too. You can even choose not to download certain types of files using the -n option. For example, if you have a large hard drive and a smaller one, you can download the whole course to the large HD:
coursera-dl -u username -p password -d pathToLargeHD course_name
and only download pdf lecture notes to the smaller one
coursera-dl -u username -p password -d pathToSmallHD -n mp4,pptx course_name
I tried that over here, worked great.
Some schools prefer students don't download course materials. I succesfully downloaded Machine Learning and Algorithms courses from Stanford but could not download this one, it says "now downloadable content found":
This seems to fit the bill: https://www.coursera.org/course/datasci
Both in timing and in content this could be a good lead-in to the University of Washington's Intro to Data Science class that looks like it will have more of a focus on 'big data', NoSQL, Hadoop, data mining, etc.